Squirls
Super-quick Information Content and Random Forest Learning for Splice Variants.
This application performs prediction of deleteriousness of genomic variants with respect to mRNA splicing.
Set up Squirls
Squirls is a desktop Java application that requires several external files to run. This document explains how to download these files and prepare to run Squirls.
Note: Squirls is written with Java version 11 and will run and compile under Java 11+.
Squirls downloadable resources
There are several external files that must be downloaded prior running Squirls.
Prebuilt Squirls executable
To download the prebuilt Squirls JAR file, go to the Releases section on the Squirls GitHub page and download the latest precompiled version of Squirls.
Squirls database files
Squirls database files are available for download from:
- hg19/GRCh37
Download 2102_hg19 (~10.5 GB for download, ~15 GB unpacked)
- hg38/GRCh38
Download 2102_hg38 (~11.1 GB for download, ~16.5 GB unpacked)
After the download, unzip the archive(s) content into a folder and note the folder path.
Jannovar transcript databases
Functional annotation of variants, which is required for certain Squirls tasks, is performed using Jannovar library.
To run the annotation, Jannovar transcript database files need to be provided. The Jannovar v0.35
database files were
tested to work with Squirls.
For your convenience, the files containing UCSC, RefSeq, or ENSEMBL transcripts for hg19 or hg38 genome assemblies are available for download (~330 MB for download, ~330 MB unpacked).
Build Squirls from source
As an alternative to using prebuilt Squirls JAR file, the Squirls JAR file can also be built from Java sources.
Run the following commands to download Squirls source code from GitHub repository and to build Squirls JAR file:
$ git https://github.com/TheJacksonLaboratory/Squirls
$ cd Squirls
$ ./mvnw package
Note
To build Squirls from sources, JDK 11 or better must be available in the environment
After the successful build, the JAR file is located at squirls-cli/target/squirls-cli-1.0.0.jar
.
To verify that the building process went well, run:
$ java -jar squirls-cli/target/squirls-cli-1.0.0.jar --help
generate-config
- Generate and fill the configuration file
Squirls needs to know about the locations of the external files. The locations are provided in a YAML configuration file.
The command generate-config
generates an empty configuration file:
$ java -jar squirls-cli.jar generate-config squirls-config.yml
The command above generates an empty configuration file squirls-config.yml
in the working directory.
The configuration file has the following content:
# Required properties template, the file follows YAML syntax.
squirls:
# path to directory with Squirls files
data-directory:
# Genome assembly - choose from {hg19, hg38}
genome-assembly:
# Exomiser-like data version (1902 in examples above)
data-version:
# Variant with longer REF allele will not be evaluated
#max-variant-length: 100
#classifier:
# Which classifier to use
#version: v0.4.6
#annotator:
# Which splicing annotator to use
# version: agez
Mandatory parameters
Open the file in your favorite text editor and provide the following three bits of information:
squirls.data-directory
- location the the folder with Squirls data. The directory is expected to have a structure like:squirls_folder |- 1902_hg19: | |- 1902_hg19.fa | |- 1902_hg19.fa.dict | |- 1902_hg19.fa.fai | |- 1902_hg19.phylop.bw | \- 1902_hg19_splicing.mv.db \- 1902_hg38 |- 1902_hg38.fa ...
where
1902_hg19
,1902_hg38
correspond to content of the ZIP files downloaded in the previous sectionsquirls.genome-assembly
- which genome assembly to use, choose from{hg19, hg38}
squirls.data-version
- which data version to use, the data version corresponds to1902
in the example above
Optional parameters
squirls.max-variant-length
- set the maximal length of the variant to be analyzed (100 bp
by default)
Run Squirls
Squirls is a command-line Java tool that runs with Java version 11 or higher.
Before using Squirls, you must setup Squirls as describe in the Set up Squirls section.
Squirls provides three commands to annotate variants in different input formats:
annotate-pos
- quickly annotate a couple of variants, e.g.chr9:136224694A>T
annotate-csv
- annotate variants stored in a CSV fileannotate-vcf
- annotate variants in VCF fileprecalculate
- precalculate SQUIRLS scores for provided regions and store the results in a compressed VCF file
In the examples below, we assume that squirls-config.yml
points to a configuration file with correct locations of
Squirls resources.
annotate-pos
- Annotate variant positions
The easiest way to quickly calculate Squirls scores for a couple of variants is to use the annotate-pos
command:
java -jar squirls-cli.jar annotate-pos squirls-config.yml "chr9:136224694A>T" "chr3:52676065CA>C"
The command above generates the following output:
...
2000-01-01 12:34:56.309 INFO 12345 --- [ main] o.m.s.c.c.a.AnnotatePosCommand : Analyzing 2 change(s): `chr9:136224694A>T, chr3:52676065CA>C`
chr9:136224694A>T pathogenic 0.970 ENST00000371964.4=0.970115;ENST00000486887.1=0.970115;ENST00000495524.1=0.970115;NM_001278928.1=0.970115;NM_017503.4=0.970115;uc004cdi.2=0.970115
chr3:52676065CA>C neutral 0.007 ENST00000296302.7=0.007040;ENST00000337303.4=0.007040;ENST00000356770.4=0.007040;ENST00000394830.3=0.007040;ENST00000409057.1=0.007040;ENST00000409114.3=0.007040;ENST00000409767.1=0.007040;ENST00000410007.1=0.007040;ENST00000412587.1=0.007040;ENST00000423351.1=0.007040;ENST00000446103.1=0.007040;NM_018313.4=0.007040;XM_005265275.1=0.007040;XM_005265276.1=0.007040;XM_005265277.1=0.007040;XM_005265278.1=0.007040;XM_005265279.1=0.007040;XM_005265280.1=0.007040;XM_005265281.1=0.007040;XM_005265282.1=0.007040;XM_005265283.1=0.007040;XM_005265284.1=0.007040;XM_005265285.1=0.007040;XM_005265286.1=0.007040;XM_005265287.1=0.007040;XM_005265288.1=0.007040;XM_005265289.1=0.007040;XM_005265290.1=0.007040;XM_005265291.1=0.007040;XM_005265292.1=0.007040;uc003deq.2=0.007040;uc003der.2=0.007040;uc003des.2=0.007040;uc003det.2=0.007040;uc003deu.2=0.007040;uc003dev.2=0.007040;uc003dew.2=0.007040;uc003dex.2=0.007040;uc003dey.2=0.007040;uc003dez.1=0.007040;uc003dfb.1=0.007040;uc010hmk.1=0.007040
Squirls reports scores in four columns:
variant position
variant interpretation, either pathogenic or neutral
maximum Squirls pathogenicity prediction
Squirls pathogenicity predictions calculated for each transcript the variant overlaps with
annotate-csv
- Annotate variant positions stored in a CSV file
To annotate more than just a few variant positions, it may be more convenient to use the annotate-csv
command.
Let’s run the annotate-csv
command to annotate four variants stored in the example.csv file
(an example CSV file with 4 variants stored in Squirls repository):
java -jar squirls-cli.jar annotate-csv squirls-config.yml example.csv output.csv
Squirls will write the scores into output.csv
file:
CHROM,POS,REF,ALT,INTERPRETATION,MAX_SCORE,SCORES
chr9,136224694,A,T,pathogenic,0.9663857211265289,ENST00000371964.4.4=0.966386;ENST00000486887.1.1=0.966386;ENST00000495524.1.1=0.966386;NM_001278928.1=0.966386;NM_017503.4=0.966386;uc004cdi.2=0.966386
chr3,52676065,CA,C,neutral,0.008163212387616258,ENST00000296302.7.7=0.008163;ENST00000337303.4.4=0.008163;ENST00000356770.4.4=0.008163;ENST00000394830.3.3=0.008163;ENST00000409057.1.1=0.008163;ENST00000409114.3.3=0.008163;ENST00000409767.1.1=0.008163;ENST00000410007.1.1=0.008163;ENST00000412587.1.1=0.008163;ENST00000423351.1.1=0.008163;ENST00000446103.1.1=0.008163;NM_018313.4=0.008163;XM_005265275.1=0.008163;XM_005265276.1=0.008163;XM_005265277.1=0.008163;XM_005265278.1=0.008163;XM_005265279.1=0.008163;XM_005265280.1=0.008163;XM_005265281.1=0.008163;XM_005265282.1=0.008163;XM_005265283.1=0.008163;XM_005265284.1=0.008163;XM_005265285.1=0.008163;XM_005265286.1=0.008163;XM_005265287.1=0.008163;XM_005265288.1=0.008163;XM_005265289.1=0.008163;XM_005265290.1=0.008163;XM_005265291.1=0.008163;XM_005265292.1=0.008163;uc003deq.2=0.008163;uc003der.2=0.008163;uc003des.2=0.008163;uc003det.2=0.008163;uc003deu.2=0.008163;uc003dev.2=0.008163;uc003dew.2=0.008163;uc003dex.2=0.008163;uc003dey.2=0.008163;uc003dez.1=0.008163;uc003dfb.1=0.008163;uc010hmk.1=0.008163
chr3,165504107,A,C,pathogenic,0.9999720330487433,ENST00000264381.3.3=0.999972;ENST00000479451.1.1=0.999972;ENST00000482958.1.1=0.999972;ENST00000488954.1.1=0.999972;ENST00000497011.1.1=0.999972;ENST00000540653.1.1=0.999972;NM_000055.2=0.999972;XM_005247685.1=0.999972;uc003fem.4=0.999972;uc003fen.4=0.999972
chr17,41197805,ACATCTGCC,A,neutral,0.010936742107683193,ENST00000309486.4.4=0.010927;ENST00000346315.3.3=0.010927;ENST00000351666.3.3=0.010927;ENST00000352993.3.3=0.010927;ENST00000354071.3.3=0.010927;ENST00000357654.3.3=0.010927;ENST00000461221.1.1=0.010937;ENST00000468300.1.1=0.010927;ENST00000471181.2.2=0.010930;ENST00000491747.2.2=0.010937;ENST00000493795.1.1=0.010930;ENST00000586385.1.1=0.010929;ENST00000591534.1.1=0.010929;ENST00000591849.1.1=0.010929;NM_007294.3=0.010927;NM_007297.3=0.010927;NM_007298.3=0.010927;NM_007299.3=0.010927;NM_007300.3=0.010927;NR_027676.1=0.010927;uc002icp.4=0.010927;uc002icq.3=0.010927;uc002ict.3=0.010927;uc002icu.3=0.010927;uc010cyx.3=0.010927;uc010whl.2=0.010927;uc010whm.2=0.010927;uc010whn.2=0.010927;uc010who.3=0.010927;uc010whp.2=0.010927
Three columns are added into the newly generated output.csv
file:
INTERPRETATION
- variant interpretation, either pathogenic or neutralMAX_SCORE
- maximum Squirls pathogenicity predictionSCORES
- Squirls pathogenicity predictions calculated with respect to all overlapping transcripts, stored in formatTX1=SCORE1;TX2=SCORE2;...;TXn=SCOREn
annotate-vcf
- Annotate variants in a VCF file
The aim of this command is to annotate variants in a VCF file and to store the results in one or more output formats.
Note
Squirls uses Jannovar library under the hood to perform functional variant annotation. Therefore, you must provide a location of the Jannovar transcript database in your system As a convenience, we prepared the databases for UCSC, ENSEMBL, and RefSeq transcripts for download, see Jannovar transcript databases.
To annotate variants in the example.vcf file (an example VCF file with 6 variants stored in Squirls repository), run:
$ java -jar squirls-cli.jar annotate-vcf squirls-config.yml hg19_refseq.ser example.vcf output
After the annotation, the results are stored at output.html
.
Parameters
The annotate-vcf
command requires four positional parameters:
path to Squirls configuration file
path to Jannovar transcript database (indicated as
hg19_refseq.ser
in the example above)path to VCF file with variants
output prefix for the generated files
Options
In addition to parameters, Squirls allows to fine tune the annotation using the following options (optional):
-f, --output-format
- comma separated list of output format descriptors (see below). Usehtml,vcf,vcfgz,csv,tsv
to store results in all output formats. Default:html
-n, --n-variants-to-report
- number of most pathogenic variants to include in HTML report. Default:100
-t, --n-threads
- number of threads to use for variant processing. Default:4
Note
Please note that the options must be specified before the positional parameters
Output formats
The annotate-vcf
command writes results in 4 output formats: HTML, VCF (compressed and uncompressed), CSV, and TSV. Use the -f
option
to select one or more of the desired output formats (e.g. -f html,vcf
).
HTML output format
Without specifying the -f
option, a HTML report containing the 100 most deleterious variants is produced.
The number of the reported variants is adjusted by the -n
option.
See the Result interpretation section for getting more help.
VCF output format
When including vcf
into the -f
option, a VCF file with all input variants is created. The annotation process
adds a novel FILTER and INFO field to each variant that overlaps with at least single transcript region:
SQUIRLS
- a FILTER flag indicating that the variant is considered to have a deleterious effect on >=1 overlapping transcriptSQUIRLS_SCORE
- an INFO string containing SQUIRLS scores for each variant-transcript combination. For an example variantchr1:1234C>A,G
, the field might look like:SQUIRLS_SCORE=A|NM_123456.1=0.988654|ENST00000987654.1=0.988654 SQUIRLS_SCORE=G|NM_12356.1=0.330112|ENST00000987654.1=0.330112
Multiallelic variants are broken down into separate records and processed individually. Predictions with respect to
the overlapping transcripts are separated by a pipe (|
) symbol.
The -n
option has no effect for the VCF output format.
Use vcfgz
instead of vcf
to compress the VCF output (bgzip
) on the fly.
CSV/TSV output format
To write n most deleterious variants into a CSV (or TSV) file, use csv
(tsv
) in the -f
option.
In result, the tabular files with the following columns are created:
chrom |
pos |
ref |
alt |
gene_symbol |
tx_accession |
interpretation |
squirls_score |
---|---|---|---|---|---|---|---|
chr3 |
165504107 |
A |
C |
BCHE |
NM_000055.2 |
pathogenic |
0.99997203304 |
… |
… |
… |
… |
… |
… |
… |
… |
precalculate
- Precalculate SQUIRLS scores
We do not provide a tabular file with precalculated scores for all possible genomic variants. Instead of that, we provide a command for precalculating the scores for your genomic regions of interest. This command precalculates Squirls scores for all possible variants (including INDELs up to specified length) and stores the scores in a compressed VCF file.
Example:
$ java -jar squirls-cli.jar precalculate squirls-config.yml CM000669.1:44187000-44187600 CM000669.1:44186000-44186500
The command computes scores for two regions, each region encompassing an exons of the GCK gene plus some neighboring
intronic sequence. SQUIRLS
recognizes GenBank, RefSeq, UCSC, and simple
(1
, 2
, …, X
, Y
, MT
) contigs accessions.
The region coordinates must be provided using zero-based coordinates where the start position is not part of the region.
By default, SQUIRLS generates all possible SNVs for the bases of the region, including deletion of the base.
For example, a region \(r\) spanning ctg1:3-5
of a 10bp-long reference contig ctg1
:
>ctg1
ACGTACGTAC
yields the variants:
chrom |
pos |
SNVs |
DELs |
INSs |
---|---|---|---|---|
ctg1 |
4 |
|
|
N/A |
ctg1 |
5 |
|
|
N/A |
the annotated variants are stored in a compressed VCF file named squirls-scores.vcf.gz
that is by default stored in
the current working directory.
Please note that the VCF file not sorted. Please sort and index the VCF file yourself, e.g. by running:
bcftools sort squirls-scores.vcf.gz | bgzip -c > squirls-scores.sorted.vcf.gz
tabix squirls-scores.sorted.vcf.gz
Parameters
There is one required positional parameter that is path to Squirls configuration file. Following that, zero or more region
definitions, e.g. CM000669.1:44187000-44187600
, CM000669.1:44186000-44186500
can be provided.
Options
There are several options to adjust:
-i, --input
- path to a BED file with the target regions. Lines starting with#
are ignored. See example regions.bed-o, --output
- path to VCF file where to write the results. The VCF output is compressed, so we recommend to use*.vcf.gz
suffix. (Default:squirls.scores.vcf.gz
)-t, --n-threads
- number of threads to use for calculating the scores. (Default:2
)--individual
- if the flag is present, predictions with respect to all overlapping transcripts will be stored within the INFO field.-l, --length
- maximum length of the generated variants on the reference genome, see Variant generation below (Default:1
)
Parallel processing
When predicting the scores, each region is handled by a single thread, while at most -t
threads being used for
prediction at the same time.
Therefore, to fully leverage the parallelism offered by modern multi-core CPUs, we recommend to split large regions
into several smaller ones.
Variant generation
The default value of the -l, --length
parameter is set to 1
. As explained above, the parameter controls
the length of the generated variants. However, length can be set to any positive integer, leading to calculation
of scores for variants of different lengths.
Using the region \(r\) and the contig ctg1
defined above, setting -l
to 2
will calculate scores for
variants:
chrom |
pos |
SNVs |
DELs |
INSs |
---|---|---|---|---|
ctg1 |
4 |
|
|
|
ctg1 |
5 |
|
|
|
Note
The number of possible variants grows exponentially with increasing of the --length
value. This can lead to
substantial run times and to extending your computational budget. Use at your own risk ;)
Result interpretation
When designing Squirls, our motivation was to create an interpretable algorithm for identification of splice deleterious variants. We addressed this goal by limiting ourselves to use a small set of biologically interpretable attributes for learning how to separate splice deleterious variants from neutral polymorphisms.
To help with interpretation of a variants that has been marked as splicing deleterious, we developed HTML result format that presents all available information in a visually attractive way. When reporting variants, we sort the variants by Squirls score in descending order - the most deleterious variants are placed on the top of the list.
The following picture shows an example output for variant NM_000251.2:c.1915C>T (chr2:47702319C>T), predicted to create a novel cryptic donor site (click for a full size image):

Squirls summarizes the information available for the variant in a box. The header of the box contains three fields:
Variant coordinates: Summary of variant’s location on used genomic assembly, e.g. chr2:47,702,319 C>T
HGVS gene symbol: e.g. MSH2
Squirls score: maximum predicted splicing pathogenicity score, e.g. 0.698
The box content consists of three sections:
Variant effects on overlapping transcripts: Squirls uses Jannovar to predict effect of the variant on the transcript, and represents the effect using HGVS Sequence Variant Nomenclature. Then, the effects and Squirls scores calculated for the overlapping transcripts are listed in a table. The transcript accessions corresponding to the maximum Squirls score are emphasized by blue color (all 4 rows in the above example).
Splicing features: Squirls shows splicing feature values in a table, see Splice features section for explanations.
Figures: Squirls presents SVG graphics that show the most important predicted effects.
Variant categories
When generating the HTML report, Squirls makes a decision about which set of figures to make for a given variant. Based on the most likely splice altering-pathomechanism, the variant is assigned into one of four categories that dictate which figures will be generated:
Canonical donor
Squirls creates a Sequence trekker for variants that are likely to disrupt a canonical donor site and to lead to either exon skipping or to utilization of a weaker cryptic site located nearby. In addition, Squirls plots the position of \(\Delta R_i\) canonical donor in the distribution of random changes to sequences of the same length (see \Delta R_i score distribution).
Sequence trekker summarizes the sequence context and the impact of the variant on the binding site. Let’s consider
a A>T
variant located at position 4 with respect to exon/intron border. The ref allele represented by normal a
character
is substituted by t
. The change location is highlighted by a black box. The unfavourable contact between
spliceosome and the alt allele is represented by drawing the red bar corresponding to t
upside down, and by
drawing box for the a
of the ref allele upwards:
The distribution of random changes shows position of \(\Delta R_i\) canonical donor of the particular variant in the distribution of random changes to sequence of the same length. When \(\Delta R_i\) score is positive and close to the distribution edge, then the variant reduces the sequence information and the resulting allele is less likely to be recognized as a donor site.
Cryptic donor
For a variant predicted to create a cryptic donor site, we generate Sequence trekkers to compare the candidate site to the closest canonical donor site.
Let’s consider the case of a missense variant chr2:47,702,319C>T
(NM_000251.2: c.1915C>T) reported by
Liu et al., 1994 (Table 2, Kindred JV). The variant is located 91 bases upstream of the canonical donor site and
introduces a cryptic donor site into coding sequence of the MSH2 gene.
Squirls generates a Sequence trekker for the Canonical donor site, the site features \(R_i=6.10\) bits:
The candidate cryptic donor site consists of the following alleles, where (again) the change C>T
is located
91 bases upstream of the canonical donor site:
CAGG
CATGC
(ref)CAGG
TATGC
(alt)
Squirls represents alleles of the Predicted cryptic donor site by the following sequence trekker:
The T
base introduced by the variant increases \(R_i\) of the site by 2.49
bits to \(R_{i\ alt}=8.59\)
bits.
The increase is graphically represented by drawing an upside-down blue box for c
(an unfavorable contact),
and upwards pointing box for t
to represent a favourable interaction between the alt allele and
the spliceosome.
Canonical acceptor
For a variant that is likely to disrupt a canonical acceptor site, we create Sequence trekker, and we plot position of \(\Delta R_i\) canonical acceptor in the distribution of random changes to sequence of the same length (see \Delta R_i score distribution).
Sequence trekker shows relative importance of the individual positions of the acceptor site and the impact of the variant on the site.
We also show position of \(\Delta R_i\) canonical acceptor in the distribution of random changes to sequence of the same length. Here, the \(\Delta R_i\) score will be positive if the variant reduces the sequence information and if the variant is likely to reduce recognition of the acceptor site.
Additionally, the variants that introduce (Y)AG
sequence into the AG-exclusion zone might lead to exon skipping
or to cryptic splicing (see Wimmer et al., 2020). The info regarding violation of the AG-exclusion zone is located
in the splice features table.
Cryptic acceptor
For the variant that leads to creation of a cryptic acceptor site, Squirls generates the same graphics as for the cryptic donor sites - two Sequence trekkers to compare the candidate cryptic acceptor site to the closest canonical acceptor site.
Let’s consider the case of the variant chr1:16,451,824C>T
(NM_004431.3: c.2826-9G>A) located 9 bases upstream of
the canonical acceptor site that introduces a cryptic acceptor site into the EPHA2 gene (Zhang et al., 2009).
The first sequence walker represents the Canonical acceptor site, located 9 bp downstream of the variant site:
The alt allele of the canonical site has \(R_i=7.26\) bits.
Then, the Predicted cryptic acceptor site consists of these alleles:
ctaactctccctctctccctccc
ggCC
(ref)ctaactctccctctctccctccc
agCC
(alt)
The corresponding sequence trekker is:
The cryptic acceptor site features \(R_i = 11.98\) bits. Sequence trekker depicts the change by drawing the
orange box for g
upside down (an unfavorable contact), and by drawing the green box for a
upwards
(a favourable interaction). The changed position is emphasized by a black box on the sequence ruler.
Note
Please note that Squirls uses the alt allele to generate sequences necessary to draw sequence trekkers for both canonical site and cryptic site. This is because we are interested in comparing the sites and not the individual alleles.
Figure types
This section provides detailed explanations of the figures we generate for the variants, as described in the previous section. We consider these figures to be the most helpful for clinical interpretation of the splice variants.
Sequence ruler
Sequence rulers are SVG graphics that show the sequence of the donor or acceptor site, mark the intron-exon boundary (red vertical bar), and show the position of any alternate bases that diverge from the reference sequence (black rectangle).
Note
We intentionally omit the position zero in sequence rulers, to make the result interpretation easier for biologists, who are more comfortable with numbering of intronic/exonic bases that starts at one.
However, please note that the correct numbering scheme starts at zero. Please visit website of professor Tom Schneider, where among Pitfalls in Information Theory he also explains the correct numbering scheme for sequences.
Sequence logo
In 1990, Tom Schneider introduced Sequence logos as a way of graphically displaying consensus sequences. The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. The height of each letter is made proportional to its frequency, and the letters are sorted so the most common one is on top. The height of the entire stack is then adjusted to signify the information content of the sequences at that position. From these sequence logos, one can determine not only the consensus sequence but also the relative frequency of bases and the information content (measured in bits) at every position in a site or sequence. The logo displays both significant residues and subtle sequence patterns (Nucleic Acids Res 1990;18:6097-100).
Sequence walker
Tom Schneider introduced Sequence walkers in 1995 as a way of graphically displaying how binding proteins and other macromolecules interact with individual bases of nucleotide sequences. Characters representing the sequence are either oriented normally and placed above a line indicating favorable contact, or upside-down and placed below the line indicating unfavorable contact. The positive or negative height of each letter shows the contribution of that base to the average sequence conservation of the binding site, as represented by a sequence logo (Nucleic Acids Res 1997;25:4408-15).
In 1998, Peter Rogan introduced the application of individual information content and Sequence walkers to splicing variants (Hum Mutat 1998;12:153-71).
Note
Squirls does not generate Sequence walker graphics for sequences. Instead, Squirls uses Sequence trekker, a graphics based on Sequence walker that is explained in the next section.
Sequence trekker
Squirls combines the sequence ruler, sequence logo, and sequence walker into a new figure that we call Sequence trekker (because a trek goes further than a walk).
On top of that, sequence trekker integrates the information regarding the reference and the alternate alleles into a single graphics.
Sequence trekker replaces the letters used in Sequence walker by bars. The bars are colored using the standard “Sanger” color conventions. Similarly to Sequence walker, the bar orientation indicates favorable (up) or unfavorable (down) contacts. The bar height shows the contribution of that base to the average sequence contribution of the binding site. To present data for reference and alternate alleles in the same time, the bar corresponding to the reference allele at the variant position is drawn with a semi-transparent fill.
In many disease-associated variants, the bar corresponding to the reference base will be positioned upright and the alternate base will be facing down.
\(\Delta R_i\) score distribution
The individual sequence information of a sequence \(R_{i\ ref}\) and an alternate sequence \(R_{i\ alt}\) are presented using the Sequence trekker. This graphic shows the value of the difference between the reference sequence and an alternate sequence as well as the distribution of random changes to sequences of the same length. A variant that reduces the sequence information is associated with a positive \(\Delta R_i\) score (\(\Delta R_i = 8.96\) bits in this case).
Squirls anatomy
This document outlines the anatomy of the Squirls model, specifically, how a Squirls score is calculated for a variant.
As outlined in the Squirls manuscript, Squirls consists of two random forest estimators (one for the donor and the other for the acceptor site) followed by a logistic regression. Both random forests calculate predictions for a single variant, the predictions are subsequently transformed by the logistic regression into the final Squirls score. For a single variant, Squirls calculates scores for all overlapping transcripts.
Splice features
The first step of the prediction process is the calculation of a small set of interpretable numeric features for machine learning. The features are then passed to random forest estimators. The random forests use different feature subsets to perform the prediction.
Donor site-specific estimator
This section lists the features used by the donor random forest estimator:
- \(R_i\) wt donor
Information content (\(R_i\)) of the closest canonical donor site.
- \(\Delta R_i\) canonical donor
Difference between \(R_i\) of ref and alt alleles of the closest donor site (0 bits if the variant does not affect the site).
- \(\Delta R_i\) wt closest donor
Difference between \(R_i\) of the closest donor and the downstream (3’) donor site (0 bits if this is the donor site of the last intron).
- Donor offset
Number of 1 bp-long steps required to pass through the exon/intron border of the closest donor site. The number is negative if the variant is located upstream from the border.
- max \(R_i\) cryptic donor window
Maximum \(R_i\) of sliding window of all 9 bp sequences that contain the alt allele.
- \(\Delta R_i\) cryptic donor
Difference between max \(R_i\) of sliding window of all 9 bp sequences that contain the alt allele and \(R_i\) of alt allele of the closest donor site.
- phyloP
Mean phyloP score of the ref allele region.
Acceptor site-specific estimator
These are the features used by the acceptor random forest estimator:
- \(\Delta R_i\) canonical acceptor
Difference between information content (\(R_i\)) of ref and alt alleles of the closest acceptor site (
0
if the variant does not affect the acceptor site).- \(\Delta R_i\) cryptic acceptor
Difference between max \(R_i\) of sliding window applied to alt allele neighboring sequence and \(R_i\) of alt allele of the closest acceptor site.
- Creates
AG
in AGEZ 1
if the variant creates a novelAG
di-nucleotide in AGEZ,0
otherwise.- Creates
YAG
in AGEZ 1
if the variant creates a novelYAG
tri-nucleotide in AGEZ whereY
stands for a pyrimidine derivative (cytosine or thymine),0
otherwise (see Wimmer et al., 2020).- Acceptor offset
Number of 1 bp-long steps required to pass through the exon/intron border of the closest acceptor site. The number is negative if the variant is located upstream from the border.
- Exon length
Number of nucleotides spanned by the exon where the variant is located in (
-1
for non-coding variants that do not affect the canonical donor/acceptor regions).- ESRSeq
Estimate of impact of random hexamer sequences on splicing efficiency when inserted into five distinct positions of two different minigene exons obtained by in vitro screening (Ke et al., 2011).
- SMS
Estimated splicing efficiency for 7-mer sequences obtained by saturating a model exon with single and double base substitutions (saturation mutagenesis derived splicing score, Ke et al., 2018).
- phyloP
Mean phyloP score of the ref allele region.
Note
The values of all features based on information theory are in bits of information
Random forest estimators
Squirls algorithm consists of two random forest estimators trained to recognize variants that change splicing of a donor or acceptor site. Given a set of splice features, the estimator calculates deleteriousness for the corresponding variant.
If a feature cannot be calculated for a variant, the missing feature value is imputed by a median feature value that was observed during training of the model.
The random forest consists of \(n\) decision trees that use the splice features to make a decision regarding deleteriousness of the variant in question.
Logistic regression
Squirls uses logistic regression as the final step to integrate outputs of the donor and acceptor random forests into the final Squirls score.
Glossary
- AGEZ
AG‐exclusion zone, the sequence between the branch point and the proper 3’ss
AG
that is devoid ofAG
s, as defined by Gooding et al., 2006- Information content
Individual information content of a nucleotide sequence \(R_i(j)\) that is related to thermodynamic entropy and the free energy of binding. \(R_i\) can also be used to compare sites with one another.
Use Squirls as a library
Squirls has been designed to be used as a library within a larger frameworks for prioritization of genome variants. This document explains how to use Squirls API classes to enable including splicing deleteriousness predictions into a Java software.
The following sections describe how to use Squirls as a module in other Java tool.
Install Squirls modules into your local Maven repository
Please run the following command to install Squirls into the local Maven repository:
git checkout https://github.com/TheJacksonLaboratory/Squirls
cd Squirls
./mvnw install
Note
The installation requires JDK11+ to be present on the platform
Spring Boot application
After the successful installation, add the following dependency into your pom.xml
to use Squirls in a Spring boot app:
<dependency>
<groupId>org.monarchinitiative.squirls</groupId>
<artifactId>squirls-spring-boot-starter</artifactId>
<version>1.0.0</version>
</dependency>
After adding the dependency, Spring configures VariantSplicingEvaluator
bean, provided that the environment properties
squirls.data-directory
squirls.genome-assembly
squirls.data-version
are set to proper values.
The VariantSplicingEvaluator
provides the following methods to calculate SquirlsResult
for a Variant
specified by
Svart library:
SquirlsResult evaluate(Variant variant, Set<String> txIds);
SquirlsResult evaluate(Variant variant);
Please see the corresponding Javadocs to learn more about VariantSplicingEvaluator
, SquirlsResult
, etc.