Squirls

Super-quick Information Content and Random Forest Learning for Splice Variants.

This application performs prediction of deleteriousness of genomic variants with respect to mRNA splicing.

Set up Squirls

Squirls is a desktop Java application that requires several external files to run. This document explains how to download these files and prepare to run Squirls.

Note: Squirls is written with Java version 11 and will run and compile under Java 11+.

Squirls downloadable resources

There are several external files that must be downloaded prior running Squirls.

Prebuilt Squirls executable

To download the prebuilt Squirls JAR file, go to the Releases section on the Squirls GitHub page and download the latest precompiled version of Squirls.

Squirls database files

Squirls database files are available for download from:

hg19/GRCh37

Download 2102_hg19 (~10.5 GB for download, ~15 GB unpacked)

hg38/GRCh38

Download 2102_hg38 (~11.1 GB for download, ~16.5 GB unpacked)

After the download, unzip the archive(s) content into a folder and note the folder path.

Jannovar transcript databases

Functional annotation of variants, which is required for certain Squirls tasks, is performed using Jannovar library. To run the annotation, Jannovar transcript database files need to be provided. The Jannovar v0.35 database files were tested to work with Squirls.

For your convenience, the files containing UCSC, RefSeq, or ENSEMBL transcripts for hg19 or hg38 genome assemblies are available for download (~330 MB for download, ~330 MB unpacked).

Download Jannovar files from here.

Build Squirls from source

As an alternative to using prebuilt Squirls JAR file, the Squirls JAR file can also be built from Java sources.

Run the following commands to download Squirls source code from GitHub repository and to build Squirls JAR file:

$ git https://github.com/TheJacksonLaboratory/Squirls
$ cd Squirls
$ ./mvnw package

Note

To build Squirls from sources, JDK 11 or better must be available in the environment

After the successful build, the JAR file is located at squirls-cli/target/squirls-cli-1.0.0.jar.

To verify that the building process went well, run:

$ java -jar squirls-cli/target/squirls-cli-1.0.0.jar --help

generate-config - Generate and fill the configuration file

Squirls needs to know about the locations of the external files. The locations are provided in a YAML configuration file. The command generate-config generates an empty configuration file:

$ java -jar squirls-cli.jar generate-config squirls-config.yml

The command above generates an empty configuration file squirls-config.yml in the working directory.

The configuration file has the following content:

# Required properties template, the file follows YAML syntax.
squirls:
  # path to directory with Squirls files
  data-directory:
  # Genome assembly - choose from {hg19, hg38}
  genome-assembly:
  # Exomiser-like data version (1902 in examples above)
  data-version:

  # Variant with longer REF allele will not be evaluated
  #max-variant-length: 100

  #classifier:
  # Which classifier to use
  #version: v0.4.6
  #annotator:
  # Which splicing annotator to use
  #  version: agez

Mandatory parameters

Open the file in your favorite text editor and provide the following three bits of information:

  • squirls.data-directory - location the the folder with Squirls data. The directory is expected to have a structure like:

    squirls_folder
       |- 1902_hg19:
       |   |- 1902_hg19.fa
       |   |- 1902_hg19.fa.dict
       |   |- 1902_hg19.fa.fai
       |   |- 1902_hg19.phylop.bw
       |   \- 1902_hg19_splicing.mv.db
       \- 1902_hg38
           |- 1902_hg38.fa
           ...
    

    where 1902_hg19, 1902_hg38 correspond to content of the ZIP files downloaded in the previous section

  • squirls.genome-assembly - which genome assembly to use, choose from {hg19, hg38}

  • squirls.data-version - which data version to use, the data version corresponds to 1902 in the example above

Optional parameters

  • squirls.max-variant-length - set the maximal length of the variant to be analyzed (100 bp by default)

Run Squirls

Squirls is a command-line Java tool that runs with Java version 11 or higher.

Before using Squirls, you must setup Squirls as describe in the Set up Squirls section.

Squirls provides three commands to annotate variants in different input formats:

  • annotate-pos - quickly annotate a couple of variants, e.g. chr9:136224694A>T

  • annotate-csv - annotate variants stored in a CSV file

  • annotate-vcf - annotate variants in VCF file

  • precalculate - precalculate SQUIRLS scores for provided regions and store the results in a compressed VCF file

In the examples below, we assume that squirls-config.yml points to a configuration file with correct locations of Squirls resources.

annotate-pos - Annotate variant positions

The easiest way to quickly calculate Squirls scores for a couple of variants is to use the annotate-pos command:

java -jar squirls-cli.jar annotate-pos squirls-config.yml "chr9:136224694A>T" "chr3:52676065CA>C"

The command above generates the following output:

...
2000-01-01 12:34:56.309  INFO 12345 --- [           main] o.m.s.c.c.a.AnnotatePosCommand           : Analyzing 2 change(s): `chr9:136224694A>T, chr3:52676065CA>C`

chr9:136224694A>T    pathogenic    0.970    ENST00000371964.4=0.970115;ENST00000486887.1=0.970115;ENST00000495524.1=0.970115;NM_001278928.1=0.970115;NM_017503.4=0.970115;uc004cdi.2=0.970115
chr3:52676065CA>C    neutral       0.007    ENST00000296302.7=0.007040;ENST00000337303.4=0.007040;ENST00000356770.4=0.007040;ENST00000394830.3=0.007040;ENST00000409057.1=0.007040;ENST00000409114.3=0.007040;ENST00000409767.1=0.007040;ENST00000410007.1=0.007040;ENST00000412587.1=0.007040;ENST00000423351.1=0.007040;ENST00000446103.1=0.007040;NM_018313.4=0.007040;XM_005265275.1=0.007040;XM_005265276.1=0.007040;XM_005265277.1=0.007040;XM_005265278.1=0.007040;XM_005265279.1=0.007040;XM_005265280.1=0.007040;XM_005265281.1=0.007040;XM_005265282.1=0.007040;XM_005265283.1=0.007040;XM_005265284.1=0.007040;XM_005265285.1=0.007040;XM_005265286.1=0.007040;XM_005265287.1=0.007040;XM_005265288.1=0.007040;XM_005265289.1=0.007040;XM_005265290.1=0.007040;XM_005265291.1=0.007040;XM_005265292.1=0.007040;uc003deq.2=0.007040;uc003der.2=0.007040;uc003des.2=0.007040;uc003det.2=0.007040;uc003deu.2=0.007040;uc003dev.2=0.007040;uc003dew.2=0.007040;uc003dex.2=0.007040;uc003dey.2=0.007040;uc003dez.1=0.007040;uc003dfb.1=0.007040;uc010hmk.1=0.007040

Squirls reports scores in four columns:

  • variant position

  • variant interpretation, either pathogenic or neutral

  • maximum Squirls pathogenicity prediction

  • Squirls pathogenicity predictions calculated for each transcript the variant overlaps with

annotate-csv - Annotate variant positions stored in a CSV file

To annotate more than just a few variant positions, it may be more convenient to use the annotate-csv command.

Let’s run the annotate-csv command to annotate four variants stored in the example.csv file (an example CSV file with 4 variants stored in Squirls repository):

java -jar squirls-cli.jar annotate-csv squirls-config.yml example.csv output.csv

Squirls will write the scores into output.csv file:

CHROM,POS,REF,ALT,INTERPRETATION,MAX_SCORE,SCORES
chr9,136224694,A,T,pathogenic,0.9663857211265289,ENST00000371964.4.4=0.966386;ENST00000486887.1.1=0.966386;ENST00000495524.1.1=0.966386;NM_001278928.1=0.966386;NM_017503.4=0.966386;uc004cdi.2=0.966386
chr3,52676065,CA,C,neutral,0.008163212387616258,ENST00000296302.7.7=0.008163;ENST00000337303.4.4=0.008163;ENST00000356770.4.4=0.008163;ENST00000394830.3.3=0.008163;ENST00000409057.1.1=0.008163;ENST00000409114.3.3=0.008163;ENST00000409767.1.1=0.008163;ENST00000410007.1.1=0.008163;ENST00000412587.1.1=0.008163;ENST00000423351.1.1=0.008163;ENST00000446103.1.1=0.008163;NM_018313.4=0.008163;XM_005265275.1=0.008163;XM_005265276.1=0.008163;XM_005265277.1=0.008163;XM_005265278.1=0.008163;XM_005265279.1=0.008163;XM_005265280.1=0.008163;XM_005265281.1=0.008163;XM_005265282.1=0.008163;XM_005265283.1=0.008163;XM_005265284.1=0.008163;XM_005265285.1=0.008163;XM_005265286.1=0.008163;XM_005265287.1=0.008163;XM_005265288.1=0.008163;XM_005265289.1=0.008163;XM_005265290.1=0.008163;XM_005265291.1=0.008163;XM_005265292.1=0.008163;uc003deq.2=0.008163;uc003der.2=0.008163;uc003des.2=0.008163;uc003det.2=0.008163;uc003deu.2=0.008163;uc003dev.2=0.008163;uc003dew.2=0.008163;uc003dex.2=0.008163;uc003dey.2=0.008163;uc003dez.1=0.008163;uc003dfb.1=0.008163;uc010hmk.1=0.008163
chr3,165504107,A,C,pathogenic,0.9999720330487433,ENST00000264381.3.3=0.999972;ENST00000479451.1.1=0.999972;ENST00000482958.1.1=0.999972;ENST00000488954.1.1=0.999972;ENST00000497011.1.1=0.999972;ENST00000540653.1.1=0.999972;NM_000055.2=0.999972;XM_005247685.1=0.999972;uc003fem.4=0.999972;uc003fen.4=0.999972
chr17,41197805,ACATCTGCC,A,neutral,0.010936742107683193,ENST00000309486.4.4=0.010927;ENST00000346315.3.3=0.010927;ENST00000351666.3.3=0.010927;ENST00000352993.3.3=0.010927;ENST00000354071.3.3=0.010927;ENST00000357654.3.3=0.010927;ENST00000461221.1.1=0.010937;ENST00000468300.1.1=0.010927;ENST00000471181.2.2=0.010930;ENST00000491747.2.2=0.010937;ENST00000493795.1.1=0.010930;ENST00000586385.1.1=0.010929;ENST00000591534.1.1=0.010929;ENST00000591849.1.1=0.010929;NM_007294.3=0.010927;NM_007297.3=0.010927;NM_007298.3=0.010927;NM_007299.3=0.010927;NM_007300.3=0.010927;NR_027676.1=0.010927;uc002icp.4=0.010927;uc002icq.3=0.010927;uc002ict.3=0.010927;uc002icu.3=0.010927;uc010cyx.3=0.010927;uc010whl.2=0.010927;uc010whm.2=0.010927;uc010whn.2=0.010927;uc010who.3=0.010927;uc010whp.2=0.010927

Three columns are added into the newly generated output.csv file:

  • INTERPRETATION - variant interpretation, either pathogenic or neutral

  • MAX_SCORE - maximum Squirls pathogenicity prediction

  • SCORES - Squirls pathogenicity predictions calculated with respect to all overlapping transcripts, stored in format TX1=SCORE1;TX2=SCORE2;...;TXn=SCOREn

annotate-vcf - Annotate variants in a VCF file

The aim of this command is to annotate variants in a VCF file and to store the results in one or more output formats.

Note

Squirls uses Jannovar library under the hood to perform functional variant annotation. Therefore, you must provide a location of the Jannovar transcript database in your system As a convenience, we prepared the databases for UCSC, ENSEMBL, and RefSeq transcripts for download, see Jannovar transcript databases.

To annotate variants in the example.vcf file (an example VCF file with 6 variants stored in Squirls repository), run:

$ java -jar squirls-cli.jar annotate-vcf squirls-config.yml hg19_refseq.ser example.vcf output

After the annotation, the results are stored at output.html.

Parameters

The annotate-vcf command requires four positional parameters:

  • path to Squirls configuration file

  • path to Jannovar transcript database (indicated as hg19_refseq.ser in the example above)

  • path to VCF file with variants

  • output prefix for the generated files

Options

In addition to parameters, Squirls allows to fine tune the annotation using the following options (optional):

  • -f, --output-format - comma separated list of output format descriptors (see below). Use html,vcf,vcfgz,csv,tsv to store results in all output formats. Default: html

  • -n, --n-variants-to-report - number of most pathogenic variants to include in HTML report. Default: 100

  • -t, --n-threads - number of threads to use for variant processing. Default: 4

Note

Please note that the options must be specified before the positional parameters

Output formats

The annotate-vcf command writes results in 4 output formats: HTML, VCF (compressed and uncompressed), CSV, and TSV. Use the -f option to select one or more of the desired output formats (e.g. -f html,vcf).

HTML output format

Without specifying the -f option, a HTML report containing the 100 most deleterious variants is produced. The number of the reported variants is adjusted by the -n option.

See the Result interpretation section for getting more help.

VCF output format

When including vcf into the -f option, a VCF file with all input variants is created. The annotation process adds a novel FILTER and INFO field to each variant that overlaps with at least single transcript region:

  • SQUIRLS - a FILTER flag indicating that the variant is considered to have a deleterious effect on >=1 overlapping transcript

  • SQUIRLS_SCORE - an INFO string containing SQUIRLS scores for each variant-transcript combination. For an example variant chr1:1234C>A,G, the field might look like:

    SQUIRLS_SCORE=A|NM_123456.1=0.988654|ENST00000987654.1=0.988654
    SQUIRLS_SCORE=G|NM_12356.1=0.330112|ENST00000987654.1=0.330112
    

Multiallelic variants are broken down into separate records and processed individually. Predictions with respect to the overlapping transcripts are separated by a pipe (|) symbol.

The -n option has no effect for the VCF output format.

Use vcfgz instead of vcf to compress the VCF output (bgzip) on the fly.

CSV/TSV output format

To write n most deleterious variants into a CSV (or TSV) file, use csv (tsv) in the -f option.

In result, the tabular files with the following columns are created:

Tabular output

chrom

pos

ref

alt

gene_symbol

tx_accession

interpretation

squirls_score

chr3

165504107

A

C

BCHE

NM_000055.2

pathogenic

0.99997203304

precalculate - Precalculate SQUIRLS scores

We do not provide a tabular file with precalculated scores for all possible genomic variants. Instead of that, we provide a command for precalculating the scores for your genomic regions of interest. This command precalculates Squirls scores for all possible variants (including INDELs up to specified length) and stores the scores in a compressed VCF file.

Example:

$ java -jar squirls-cli.jar precalculate squirls-config.yml CM000669.1:44187000-44187600 CM000669.1:44186000-44186500

The command computes scores for two regions, each region encompassing an exons of the GCK gene plus some neighboring intronic sequence. SQUIRLS recognizes GenBank, RefSeq, UCSC, and simple (1, 2, …, X, Y, MT) contigs accessions.

The region coordinates must be provided using zero-based coordinates where the start position is not part of the region.

By default, SQUIRLS generates all possible SNVs for the bases of the region, including deletion of the base. For example, a region \(r\) spanning ctg1:3-5 of a 10bp-long reference contig ctg1:

>ctg1
ACGTACGTAC

yields the variants:

chrom

pos

SNVs

DELs

INSs

ctg1

4

T>A, T>C, T>G

T>

N/A

ctg1

5

A>C, A>G, A>T

A>

N/A

the annotated variants are stored in a compressed VCF file named squirls-scores.vcf.gz that is by default stored in the current working directory.

Please note that the VCF file not sorted. Please sort and index the VCF file yourself, e.g. by running:

bcftools sort squirls-scores.vcf.gz | bgzip -c > squirls-scores.sorted.vcf.gz
tabix squirls-scores.sorted.vcf.gz

Parameters

There is one required positional parameter that is path to Squirls configuration file. Following that, zero or more region definitions, e.g. CM000669.1:44187000-44187600, CM000669.1:44186000-44186500 can be provided.

Options

There are several options to adjust:

  • -i, --input - path to a BED file with the target regions. Lines starting with # are ignored. See example regions.bed

  • -o, --output - path to VCF file where to write the results. The VCF output is compressed, so we recommend to use *.vcf.gz suffix. (Default: squirls.scores.vcf.gz)

  • -t, --n-threads - number of threads to use for calculating the scores. (Default: 2)

  • --individual - if the flag is present, predictions with respect to all overlapping transcripts will be stored within the INFO field.

  • -l, --length - maximum length of the generated variants on the reference genome, see Variant generation below (Default: 1)

Parallel processing

When predicting the scores, each region is handled by a single thread, while at most -t threads being used for prediction at the same time. Therefore, to fully leverage the parallelism offered by modern multi-core CPUs, we recommend to split large regions into several smaller ones.

Variant generation

The default value of the -l, --length parameter is set to 1. As explained above, the parameter controls the length of the generated variants. However, length can be set to any positive integer, leading to calculation of scores for variants of different lengths.

Using the region \(r\) and the contig ctg1 defined above, setting -l to 2 will calculate scores for variants:

The variant generation pattern

chrom

pos

SNVs

DELs

INSs

ctg1

4

T>A, T>C, T>G

T>, TA>T

T>TA, T>TC, T>TG, T>TT

ctg1

5

A>C, A>G, A>T

A>

A>AA, A>AC, A>AG, A>AT

Note

The number of possible variants grows exponentially with increasing of the --length value. This can lead to substantial run times and to extending your computational budget. Use at your own risk ;)

Result interpretation

When designing Squirls, our motivation was to create an interpretable algorithm for identification of splice deleterious variants. We addressed this goal by limiting ourselves to use a small set of biologically interpretable attributes for learning how to separate splice deleterious variants from neutral polymorphisms.

To help with interpretation of a variants that has been marked as splicing deleterious, we developed HTML result format that presents all available information in a visually attractive way. When reporting variants, we sort the variants by Squirls score in descending order - the most deleterious variants are placed on the top of the list.

The following picture shows an example output for variant NM_000251.2:c.1915C>T (chr2:47702319C>T), predicted to create a novel cryptic donor site (click for a full size image):

cryptic donor site example

Squirls summarizes the information available for the variant in a box. The header of the box contains three fields:

  • Variant coordinates: Summary of variant’s location on used genomic assembly, e.g. chr2:47,702,319 C>T

  • HGVS gene symbol: e.g. MSH2

  • Squirls score: maximum predicted splicing pathogenicity score, e.g. 0.698

The box content consists of three sections:

  • Variant effects on overlapping transcripts: Squirls uses Jannovar to predict effect of the variant on the transcript, and represents the effect using HGVS Sequence Variant Nomenclature. Then, the effects and Squirls scores calculated for the overlapping transcripts are listed in a table. The transcript accessions corresponding to the maximum Squirls score are emphasized by blue color (all 4 rows in the above example).

  • Splicing features: Squirls shows splicing feature values in a table, see Splice features section for explanations.

  • Figures: Squirls presents SVG graphics that show the most important predicted effects.

Variant categories

When generating the HTML report, Squirls makes a decision about which set of figures to make for a given variant. Based on the most likely splice altering-pathomechanism, the variant is assigned into one of four categories that dictate which figures will be generated:

Canonical donor

Squirls creates a Sequence trekker for variants that are likely to disrupt a canonical donor site and to lead to either exon skipping or to utilization of a weaker cryptic site located nearby. In addition, Squirls plots the position of \(\Delta R_i\) canonical donor in the distribution of random changes to sequences of the same length (see \Delta R_i score distribution).

Sequence trekker summarizes the sequence context and the impact of the variant on the binding site. Let’s consider a A>T variant located at position 4 with respect to exon/intron border. The ref allele represented by normal a character is substituted by t. The change location is highlighted by a black box. The unfavourable contact between spliceosome and the alt allele is represented by drawing the red bar corresponding to t upside down, and by drawing box for the a of the ref allele upwards:

canonical donor trekker

The distribution of random changes shows position of \(\Delta R_i\) canonical donor of the particular variant in the distribution of random changes to sequence of the same length. When \(\Delta R_i\) score is positive and close to the distribution edge, then the variant reduces the sequence information and the resulting allele is less likely to be recognized as a donor site.

R_i position in distribution of random changes to sequence of the same length

Cryptic donor

For a variant predicted to create a cryptic donor site, we generate Sequence trekkers to compare the candidate site to the closest canonical donor site.

Let’s consider the case of a missense variant chr2:47,702,319C>T (NM_000251.2: c.1915C>T) reported by Liu et al., 1994 (Table 2, Kindred JV). The variant is located 91 bases upstream of the canonical donor site and introduces a cryptic donor site into coding sequence of the MSH2 gene.

Squirls generates a Sequence trekker for the Canonical donor site, the site features \(R_i=6.10\) bits:

sequence walkers for alt allele of the closest donor site allele

The candidate cryptic donor site consists of the following alleles, where (again) the change C>T is located 91 bases upstream of the canonical donor site:

  • CAGGCATGC (ref)

  • CAGGTATGC (alt)

Squirls represents alleles of the Predicted cryptic donor site by the following sequence trekker:

The T base introduced by the variant increases \(R_i\) of the site by 2.49 bits to \(R_{i\ alt}=8.59\) bits. The increase is graphically represented by drawing an upside-down blue box for c (an unfavorable contact), and upwards pointing box for t to represent a favourable interaction between the alt allele and the spliceosome.

Canonical acceptor

For a variant that is likely to disrupt a canonical acceptor site, we create Sequence trekker, and we plot position of \(\Delta R_i\) canonical acceptor in the distribution of random changes to sequence of the same length (see \Delta R_i score distribution).

Sequence trekker shows relative importance of the individual positions of the acceptor site and the impact of the variant on the site.

canonical acceptor sequence trekker

We also show position of \(\Delta R_i\) canonical acceptor in the distribution of random changes to sequence of the same length. Here, the \(\Delta R_i\) score will be positive if the variant reduces the sequence information and if the variant is likely to reduce recognition of the acceptor site.

R_i position in distribution of random changes to sequence of the same length

Additionally, the variants that introduce (Y)AG sequence into the AG-exclusion zone might lead to exon skipping or to cryptic splicing (see Wimmer et al., 2020). The info regarding violation of the AG-exclusion zone is located in the splice features table.

Cryptic acceptor

For the variant that leads to creation of a cryptic acceptor site, Squirls generates the same graphics as for the cryptic donor sites - two Sequence trekkers to compare the candidate cryptic acceptor site to the closest canonical acceptor site.

Let’s consider the case of the variant chr1:16,451,824C>T (NM_004431.3: c.2826-9G>A) located 9 bases upstream of the canonical acceptor site that introduces a cryptic acceptor site into the EPHA2 gene (Zhang et al., 2009).

The first sequence walker represents the Canonical acceptor site, located 9 bp downstream of the variant site:

cryptic acceptor variant chr1:16,451,824C>T

The alt allele of the canonical site has \(R_i=7.26\) bits.

Then, the Predicted cryptic acceptor site consists of these alleles:

  • ctaactctccctctctccctcccggCC (ref)

  • ctaactctccctctctccctcccagCC (alt)

The corresponding sequence trekker is:

comparison of cryptic acceptor sites for variant chr1:16,451,824C>T

The cryptic acceptor site features \(R_i = 11.98\) bits. Sequence trekker depicts the change by drawing the orange box for g upside down (an unfavorable contact), and by drawing the green box for a upwards (a favourable interaction). The changed position is emphasized by a black box on the sequence ruler.

Note

Please note that Squirls uses the alt allele to generate sequences necessary to draw sequence trekkers for both canonical site and cryptic site. This is because we are interested in comparing the sites and not the individual alleles.

Figure types

This section provides detailed explanations of the figures we generate for the variants, as described in the previous section. We consider these figures to be the most helpful for clinical interpretation of the splice variants.

Sequence ruler

sequence ruler

Sequence rulers are SVG graphics that show the sequence of the donor or acceptor site, mark the intron-exon boundary (red vertical bar), and show the position of any alternate bases that diverge from the reference sequence (black rectangle).

Note

We intentionally omit the position zero in sequence rulers, to make the result interpretation easier for biologists, who are more comfortable with numbering of intronic/exonic bases that starts at one.

However, please note that the correct numbering scheme starts at zero. Please visit website of professor Tom Schneider, where among Pitfalls in Information Theory he also explains the correct numbering scheme for sequences.

Sequence walker

sequence walker

Tom Schneider introduced Sequence walkers in 1995 as a way of graphically displaying how binding proteins and other macromolecules interact with individual bases of nucleotide sequences. Characters representing the sequence are either oriented normally and placed above a line indicating favorable contact, or upside-down and placed below the line indicating unfavorable contact. The positive or negative height of each letter shows the contribution of that base to the average sequence conservation of the binding site, as represented by a sequence logo (Nucleic Acids Res 1997;25:4408-15).

In 1998, Peter Rogan introduced the application of individual information content and Sequence walkers to splicing variants (Hum Mutat 1998;12:153-71).

Note

Squirls does not generate Sequence walker graphics for sequences. Instead, Squirls uses Sequence trekker, a graphics based on Sequence walker that is explained in the next section.

Sequence trekker

sequence trekker

Squirls combines the sequence ruler, sequence logo, and sequence walker into a new figure that we call Sequence trekker (because a trek goes further than a walk).

On top of that, sequence trekker integrates the information regarding the reference and the alternate alleles into a single graphics.

Sequence trekker replaces the letters used in Sequence walker by bars. The bars are colored using the standard “Sanger” color conventions. Similarly to Sequence walker, the bar orientation indicates favorable (up) or unfavorable (down) contacts. The bar height shows the contribution of that base to the average sequence contribution of the binding site. To present data for reference and alternate alleles in the same time, the bar corresponding to the reference allele at the variant position is drawn with a semi-transparent fill.

In many disease-associated variants, the bar corresponding to the reference base will be positioned upright and the alternate base will be facing down.

\(\Delta R_i\) score distribution

delta R_i score distribution

The individual sequence information of a sequence \(R_{i\ ref}\) and an alternate sequence \(R_{i\ alt}\) are presented using the Sequence trekker. This graphic shows the value of the difference between the reference sequence and an alternate sequence as well as the distribution of random changes to sequences of the same length. A variant that reduces the sequence information is associated with a positive \(\Delta R_i\) score (\(\Delta R_i = 8.96\) bits in this case).

Squirls anatomy

This document outlines the anatomy of the Squirls model, specifically, how a Squirls score is calculated for a variant.

As outlined in the Squirls manuscript, Squirls consists of two random forest estimators (one for the donor and the other for the acceptor site) followed by a logistic regression. Both random forests calculate predictions for a single variant, the predictions are subsequently transformed by the logistic regression into the final Squirls score. For a single variant, Squirls calculates scores for all overlapping transcripts.

Splice features

The first step of the prediction process is the calculation of a small set of interpretable numeric features for machine learning. The features are then passed to random forest estimators. The random forests use different feature subsets to perform the prediction.

Donor site-specific estimator

This section lists the features used by the donor random forest estimator:

\(R_i\) wt donor

Information content (\(R_i\)) of the closest canonical donor site.

\(\Delta R_i\) canonical donor

Difference between \(R_i\) of ref and alt alleles of the closest donor site (0 bits if the variant does not affect the site).

\(\Delta R_i\) wt closest donor

Difference between \(R_i\) of the closest donor and the downstream (3’) donor site (0 bits if this is the donor site of the last intron).

Donor offset

Number of 1 bp-long steps required to pass through the exon/intron border of the closest donor site. The number is negative if the variant is located upstream from the border.

max \(R_i\) cryptic donor window

Maximum \(R_i\) of sliding window of all 9 bp sequences that contain the alt allele.

\(\Delta R_i\) cryptic donor

Difference between max \(R_i\) of sliding window of all 9 bp sequences that contain the alt allele and \(R_i\) of alt allele of the closest donor site.

phyloP

Mean phyloP score of the ref allele region.

Acceptor site-specific estimator

These are the features used by the acceptor random forest estimator:

\(\Delta R_i\) canonical acceptor

Difference between information content (\(R_i\)) of ref and alt alleles of the closest acceptor site (0 if the variant does not affect the acceptor site).

\(\Delta R_i\) cryptic acceptor

Difference between max \(R_i\) of sliding window applied to alt allele neighboring sequence and \(R_i\) of alt allele of the closest acceptor site.

Creates AG in AGEZ

1 if the variant creates a novel AG di-nucleotide in AGEZ, 0 otherwise.

Creates YAG in AGEZ

1 if the variant creates a novel YAG tri-nucleotide in AGEZ where Y stands for a pyrimidine derivative (cytosine or thymine), 0 otherwise (see Wimmer et al., 2020).

Acceptor offset

Number of 1 bp-long steps required to pass through the exon/intron border of the closest acceptor site. The number is negative if the variant is located upstream from the border.

Exon length

Number of nucleotides spanned by the exon where the variant is located in (-1 for non-coding variants that do not affect the canonical donor/acceptor regions).

ESRSeq

Estimate of impact of random hexamer sequences on splicing efficiency when inserted into five distinct positions of two different minigene exons obtained by in vitro screening (Ke et al., 2011).

SMS

Estimated splicing efficiency for 7-mer sequences obtained by saturating a model exon with single and double base substitutions (saturation mutagenesis derived splicing score, Ke et al., 2018).

phyloP

Mean phyloP score of the ref allele region.

Note

The values of all features based on information theory are in bits of information

Random forest estimators

Squirls algorithm consists of two random forest estimators trained to recognize variants that change splicing of a donor or acceptor site. Given a set of splice features, the estimator calculates deleteriousness for the corresponding variant.

If a feature cannot be calculated for a variant, the missing feature value is imputed by a median feature value that was observed during training of the model.

The random forest consists of \(n\) decision trees that use the splice features to make a decision regarding deleteriousness of the variant in question.

Logistic regression

Squirls uses logistic regression as the final step to integrate outputs of the donor and acceptor random forests into the final Squirls score.

Glossary

AGEZ

AG‐exclusion zone, the sequence between the branch point and the proper 3’ss AG that is devoid of AGs, as defined by Gooding et al., 2006

Information content

Individual information content of a nucleotide sequence \(R_i(j)\) that is related to thermodynamic entropy and the free energy of binding. \(R_i\) can also be used to compare sites with one another.

Use Squirls as a library

Squirls has been designed to be used as a library within a larger frameworks for prioritization of genome variants. This document explains how to use Squirls API classes to enable including splicing deleteriousness predictions into a Java software.

The following sections describe how to use Squirls as a module in other Java tool.

Install Squirls modules into your local Maven repository

Please run the following command to install Squirls into the local Maven repository:

git checkout https://github.com/TheJacksonLaboratory/Squirls
cd Squirls
./mvnw install

Note

The installation requires JDK11+ to be present on the platform

Spring Boot application

After the successful installation, add the following dependency into your pom.xml to use Squirls in a Spring boot app:

<dependency>
  <groupId>org.monarchinitiative.squirls</groupId>
  <artifactId>squirls-spring-boot-starter</artifactId>
  <version>1.0.0</version>
</dependency>

After adding the dependency, Spring configures VariantSplicingEvaluator bean, provided that the environment properties

  • squirls.data-directory

  • squirls.genome-assembly

  • squirls.data-version

are set to proper values.

The VariantSplicingEvaluator provides the following methods to calculate SquirlsResult for a Variant specified by Svart library:

SquirlsResult evaluate(Variant variant, Set<String> txIds);
SquirlsResult evaluate(Variant variant);

Please see the corresponding Javadocs to learn more about VariantSplicingEvaluator, SquirlsResult, etc.