Run Squirls
Squirls is a command-line Java tool that runs with Java version 11 or higher.
Before using Squirls, you must setup Squirls as describe in the Set up Squirls section.
Squirls provides three commands to annotate variants in different input formats:
annotate-pos
- quickly annotate a couple of variants, e.g.chr9:136224694A>T
annotate-csv
- annotate variants stored in a CSV fileannotate-vcf
- annotate variants in VCF fileprecalculate
- precalculate SQUIRLS scores for provided regions and store the results in a compressed VCF file
In the examples below, we assume that squirls-config.yml
points to a configuration file with correct locations of
Squirls resources.
annotate-pos
- Annotate variant positions
The easiest way to quickly calculate Squirls scores for a couple of variants is to use the annotate-pos
command:
java -jar squirls-cli.jar annotate-pos squirls-config.yml "chr9:136224694A>T" "chr3:52676065CA>C"
The command above generates the following output:
...
2000-01-01 12:34:56.309 INFO 12345 --- [ main] o.m.s.c.c.a.AnnotatePosCommand : Analyzing 2 change(s): `chr9:136224694A>T, chr3:52676065CA>C`
chr9:136224694A>T pathogenic 0.970 ENST00000371964.4=0.970115;ENST00000486887.1=0.970115;ENST00000495524.1=0.970115;NM_001278928.1=0.970115;NM_017503.4=0.970115;uc004cdi.2=0.970115
chr3:52676065CA>C neutral 0.007 ENST00000296302.7=0.007040;ENST00000337303.4=0.007040;ENST00000356770.4=0.007040;ENST00000394830.3=0.007040;ENST00000409057.1=0.007040;ENST00000409114.3=0.007040;ENST00000409767.1=0.007040;ENST00000410007.1=0.007040;ENST00000412587.1=0.007040;ENST00000423351.1=0.007040;ENST00000446103.1=0.007040;NM_018313.4=0.007040;XM_005265275.1=0.007040;XM_005265276.1=0.007040;XM_005265277.1=0.007040;XM_005265278.1=0.007040;XM_005265279.1=0.007040;XM_005265280.1=0.007040;XM_005265281.1=0.007040;XM_005265282.1=0.007040;XM_005265283.1=0.007040;XM_005265284.1=0.007040;XM_005265285.1=0.007040;XM_005265286.1=0.007040;XM_005265287.1=0.007040;XM_005265288.1=0.007040;XM_005265289.1=0.007040;XM_005265290.1=0.007040;XM_005265291.1=0.007040;XM_005265292.1=0.007040;uc003deq.2=0.007040;uc003der.2=0.007040;uc003des.2=0.007040;uc003det.2=0.007040;uc003deu.2=0.007040;uc003dev.2=0.007040;uc003dew.2=0.007040;uc003dex.2=0.007040;uc003dey.2=0.007040;uc003dez.1=0.007040;uc003dfb.1=0.007040;uc010hmk.1=0.007040
Squirls reports scores in four columns:
variant position
variant interpretation, either pathogenic or neutral
maximum Squirls pathogenicity prediction
Squirls pathogenicity predictions calculated for each transcript the variant overlaps with
annotate-csv
- Annotate variant positions stored in a CSV file
To annotate more than just a few variant positions, it may be more convenient to use the annotate-csv
command.
Let’s run the annotate-csv
command to annotate four variants stored in the example.csv file
(an example CSV file with 4 variants stored in Squirls repository):
java -jar squirls-cli.jar annotate-csv squirls-config.yml example.csv output.csv
Squirls will write the scores into output.csv
file:
CHROM,POS,REF,ALT,INTERPRETATION,MAX_SCORE,SCORES
chr9,136224694,A,T,pathogenic,0.9663857211265289,ENST00000371964.4.4=0.966386;ENST00000486887.1.1=0.966386;ENST00000495524.1.1=0.966386;NM_001278928.1=0.966386;NM_017503.4=0.966386;uc004cdi.2=0.966386
chr3,52676065,CA,C,neutral,0.008163212387616258,ENST00000296302.7.7=0.008163;ENST00000337303.4.4=0.008163;ENST00000356770.4.4=0.008163;ENST00000394830.3.3=0.008163;ENST00000409057.1.1=0.008163;ENST00000409114.3.3=0.008163;ENST00000409767.1.1=0.008163;ENST00000410007.1.1=0.008163;ENST00000412587.1.1=0.008163;ENST00000423351.1.1=0.008163;ENST00000446103.1.1=0.008163;NM_018313.4=0.008163;XM_005265275.1=0.008163;XM_005265276.1=0.008163;XM_005265277.1=0.008163;XM_005265278.1=0.008163;XM_005265279.1=0.008163;XM_005265280.1=0.008163;XM_005265281.1=0.008163;XM_005265282.1=0.008163;XM_005265283.1=0.008163;XM_005265284.1=0.008163;XM_005265285.1=0.008163;XM_005265286.1=0.008163;XM_005265287.1=0.008163;XM_005265288.1=0.008163;XM_005265289.1=0.008163;XM_005265290.1=0.008163;XM_005265291.1=0.008163;XM_005265292.1=0.008163;uc003deq.2=0.008163;uc003der.2=0.008163;uc003des.2=0.008163;uc003det.2=0.008163;uc003deu.2=0.008163;uc003dev.2=0.008163;uc003dew.2=0.008163;uc003dex.2=0.008163;uc003dey.2=0.008163;uc003dez.1=0.008163;uc003dfb.1=0.008163;uc010hmk.1=0.008163
chr3,165504107,A,C,pathogenic,0.9999720330487433,ENST00000264381.3.3=0.999972;ENST00000479451.1.1=0.999972;ENST00000482958.1.1=0.999972;ENST00000488954.1.1=0.999972;ENST00000497011.1.1=0.999972;ENST00000540653.1.1=0.999972;NM_000055.2=0.999972;XM_005247685.1=0.999972;uc003fem.4=0.999972;uc003fen.4=0.999972
chr17,41197805,ACATCTGCC,A,neutral,0.010936742107683193,ENST00000309486.4.4=0.010927;ENST00000346315.3.3=0.010927;ENST00000351666.3.3=0.010927;ENST00000352993.3.3=0.010927;ENST00000354071.3.3=0.010927;ENST00000357654.3.3=0.010927;ENST00000461221.1.1=0.010937;ENST00000468300.1.1=0.010927;ENST00000471181.2.2=0.010930;ENST00000491747.2.2=0.010937;ENST00000493795.1.1=0.010930;ENST00000586385.1.1=0.010929;ENST00000591534.1.1=0.010929;ENST00000591849.1.1=0.010929;NM_007294.3=0.010927;NM_007297.3=0.010927;NM_007298.3=0.010927;NM_007299.3=0.010927;NM_007300.3=0.010927;NR_027676.1=0.010927;uc002icp.4=0.010927;uc002icq.3=0.010927;uc002ict.3=0.010927;uc002icu.3=0.010927;uc010cyx.3=0.010927;uc010whl.2=0.010927;uc010whm.2=0.010927;uc010whn.2=0.010927;uc010who.3=0.010927;uc010whp.2=0.010927
Three columns are added into the newly generated output.csv
file:
INTERPRETATION
- variant interpretation, either pathogenic or neutralMAX_SCORE
- maximum Squirls pathogenicity predictionSCORES
- Squirls pathogenicity predictions calculated with respect to all overlapping transcripts, stored in formatTX1=SCORE1;TX2=SCORE2;...;TXn=SCOREn
annotate-vcf
- Annotate variants in a VCF file
The aim of this command is to annotate variants in a VCF file and to store the results in one or more output formats.
Note
Squirls uses Jannovar library under the hood to perform functional variant annotation. Therefore, you must provide a location of the Jannovar transcript database in your system As a convenience, we prepared the databases for UCSC, ENSEMBL, and RefSeq transcripts for download, see Jannovar transcript databases.
To annotate variants in the example.vcf file (an example VCF file with 6 variants stored in Squirls repository), run:
$ java -jar squirls-cli.jar annotate-vcf squirls-config.yml hg19_refseq.ser example.vcf output
After the annotation, the results are stored at output.html
.
Parameters
The annotate-vcf
command requires four positional parameters:
path to Squirls configuration file
path to Jannovar transcript database (indicated as
hg19_refseq.ser
in the example above)path to VCF file with variants
output prefix for the generated files
Options
In addition to parameters, Squirls allows to fine tune the annotation using the following options (optional):
-f, --output-format
- comma separated list of output format descriptors (see below). Usehtml,vcf,vcfgz,csv,tsv
to store results in all output formats. Default:html
-n, --n-variants-to-report
- number of most pathogenic variants to include in HTML report. Default:100
-t, --n-threads
- number of threads to use for variant processing. Default:4
Note
Please note that the options must be specified before the positional parameters
Output formats
The annotate-vcf
command writes results in 4 output formats: HTML, VCF (compressed and uncompressed), CSV, and TSV. Use the -f
option
to select one or more of the desired output formats (e.g. -f html,vcf
).
HTML output format
Without specifying the -f
option, a HTML report containing the 100 most deleterious variants is produced.
The number of the reported variants is adjusted by the -n
option.
See the Result interpretation section for getting more help.
VCF output format
When including vcf
into the -f
option, a VCF file with all input variants is created. The annotation process
adds a novel FILTER and INFO field to each variant that overlaps with at least single transcript region:
SQUIRLS
- a FILTER flag indicating that the variant is considered to have a deleterious effect on >=1 overlapping transcriptSQUIRLS_SCORE
- an INFO string containing SQUIRLS scores for each variant-transcript combination. For an example variantchr1:1234C>A,G
, the field might look like:SQUIRLS_SCORE=A|NM_123456.1=0.988654|ENST00000987654.1=0.988654 SQUIRLS_SCORE=G|NM_12356.1=0.330112|ENST00000987654.1=0.330112
Multiallelic variants are broken down into separate records and processed individually. Predictions with respect to
the overlapping transcripts are separated by a pipe (|
) symbol.
The -n
option has no effect for the VCF output format.
Use vcfgz
instead of vcf
to compress the VCF output (bgzip
) on the fly.
CSV/TSV output format
To write n most deleterious variants into a CSV (or TSV) file, use csv
(tsv
) in the -f
option.
In result, the tabular files with the following columns are created:
chrom |
pos |
ref |
alt |
gene_symbol |
tx_accession |
interpretation |
squirls_score |
---|---|---|---|---|---|---|---|
chr3 |
165504107 |
A |
C |
BCHE |
NM_000055.2 |
pathogenic |
0.99997203304 |
… |
… |
… |
… |
… |
… |
… |
… |
precalculate
- Precalculate SQUIRLS scores
We do not provide a tabular file with precalculated scores for all possible genomic variants. Instead of that, we provide a command for precalculating the scores for your genomic regions of interest. This command precalculates Squirls scores for all possible variants (including INDELs up to specified length) and stores the scores in a compressed VCF file.
Example:
$ java -jar squirls-cli.jar precalculate squirls-config.yml CM000669.1:44187000-44187600 CM000669.1:44186000-44186500
The command computes scores for two regions, each region encompassing an exons of the GCK gene plus some neighboring
intronic sequence. SQUIRLS
recognizes GenBank, RefSeq, UCSC, and simple
(1
, 2
, …, X
, Y
, MT
) contigs accessions.
The region coordinates must be provided using zero-based coordinates where the start position is not part of the region.
By default, SQUIRLS generates all possible SNVs for the bases of the region, including deletion of the base.
For example, a region \(r\) spanning ctg1:3-5
of a 10bp-long reference contig ctg1
:
>ctg1
ACGTACGTAC
yields the variants:
chrom |
pos |
SNVs |
DELs |
INSs |
---|---|---|---|---|
ctg1 |
4 |
|
|
N/A |
ctg1 |
5 |
|
|
N/A |
the annotated variants are stored in a compressed VCF file named squirls-scores.vcf.gz
that is by default stored in
the current working directory.
Please note that the VCF file not sorted. Please sort and index the VCF file yourself, e.g. by running:
bcftools sort squirls-scores.vcf.gz | bgzip -c > squirls-scores.sorted.vcf.gz
tabix squirls-scores.sorted.vcf.gz
Parameters
There is one required positional parameter that is path to Squirls configuration file. Following that, zero or more region
definitions, e.g. CM000669.1:44187000-44187600
, CM000669.1:44186000-44186500
can be provided.
Options
There are several options to adjust:
-i, --input
- path to a BED file with the target regions. Lines starting with#
are ignored. See example regions.bed-o, --output
- path to VCF file where to write the results. The VCF output is compressed, so we recommend to use*.vcf.gz
suffix. (Default:squirls.scores.vcf.gz
)-t, --n-threads
- number of threads to use for calculating the scores. (Default:2
)--individual
- if the flag is present, predictions with respect to all overlapping transcripts will be stored within the INFO field.-l, --length
- maximum length of the generated variants on the reference genome, see Variant generation below (Default:1
)
Parallel processing
When predicting the scores, each region is handled by a single thread, while at most -t
threads being used for
prediction at the same time.
Therefore, to fully leverage the parallelism offered by modern multi-core CPUs, we recommend to split large regions
into several smaller ones.
Variant generation
The default value of the -l, --length
parameter is set to 1
. As explained above, the parameter controls
the length of the generated variants. However, length can be set to any positive integer, leading to calculation
of scores for variants of different lengths.
Using the region \(r\) and the contig ctg1
defined above, setting -l
to 2
will calculate scores for
variants:
chrom |
pos |
SNVs |
DELs |
INSs |
---|---|---|---|---|
ctg1 |
4 |
|
|
|
ctg1 |
5 |
|
|
|
Note
The number of possible variants grows exponentially with increasing of the --length
value. This can lead to
substantial run times and to extending your computational budget. Use at your own risk ;)