# 6 min readAdjustment Methods to Improve Molecular Index Counting in Gene Expression Analysis

Gene expression analyses performed on single cells can offer a powerful method to resolve sample heterogeneity and reveal hidden biology in multicellular tissue samples.

The previous article in this series described a complete workflow of single cell gene expression analysis offered by BD Genomics: isolation of single cells using BD FACS^{TM} Sorter instrument, generation of gene expression libraries using BD^{TM} Precise assays and the subsequent data analysis.

This post provides a more in depth overview of the errors that can arise during library preparation and details new algorithms that adjust for these potential issues.

*To learn more about genomics solutions offered by the company, visit the BD webpage.*

## Overview of BD^{TM} Precise assays

BD^{TM} Precise assays are fast, high-throughput, next-generation sequencing (NGS) library preparation kits most suitable when used with small sample input — such as single cells — to allow for stochastic and unique labelling of mRNAs.

The assays make use of the patented BD^{TM} Molecular Indexing (MI) technology with Sample Index (SI) to label individual mRNA transcripts.

During reverse transcription, the BD^{TM} Precise assays apply a non-depleting pool of 6,561 MI barcodes (or 65,536 barcodes for BD^{TM} Precise whole transcriptome amplification assay) for stochastic and unique labelling of otherwise identical mRNA transcripts to generate unique cDNA.

In addition to MI, 96 SI barcodes are employed to identify the sample origin of each transcript according to the well position in the 96-well BD^{TM} Precise plate.

These unique barcoded primers are then used to label poly-adenylated RNA transcripts in each of the 96 wells, followed by a pooling step into a single tube for a multiplex PCR target enrichment and sequencing adapter attachment (BD^{TM} Precise targeted assay).

*To learn more about BD ^{TM} Precise kits, visit the BD website.*

The resulting library is then ready for NGS using compatible Illumina sequencers and sequencing kits. Each of the sequencing reads are then processed to identify the MI, SI and target gene using a BD primary analysis pipeline (**Figure 1**).

While the use of MIs allows accurate mRNA counting, errors in MIs can occur either during PCR or library prep steps of the protocol.

## Molecular Index counting for high-input samples

As the number of transcripts increases relative to the barcode pool, the percentage of MIs being recycled to label the same gene increases and can be theoretically calculated using the Poisson distribution (**Figure 2A**).

In these situations, without statistical correction, quantifying gene expression using MIs would underestimate the number of molecules that are initially present without any Poisson adjustments.

At extremely high inputs where the number of mRNAs per gene is beyond the entire collection of 6,561 barcodes (or 65,536 for WTA), Poisson correction is no longer possible.

In these situations, regardless of whether there are 65,000 or 100,000 copies of a particular gene in a well, a maximum of 6,561 saturated barcodes is expected in either case.

## Molecular Index errors

When looking at the number of reads for the same MI of a given gene—referred to as **MI depth**—it is possible to detect erroneous base calls or PCR errors generated during library preparation.

For example, a gene that has multiple reads with the same MI and SI represented by multiple reads is likely an accurate measurement compared to an MI for a given gene and an SI that has only a single read.

When a gene has MIs with low and high depth, the low-depth MIs are likely due to errors during the library preparation or sequencing.

These MI errors generally have distinct distributions from the true MIs (**Figure 2B**).

Described below are two sequential algorithms that are employed in the BD^{TM} Precise assays analysis pipeline to remove MI errors.

First, MI errors that manifest as single base substitution errors are identified and adjusted to the true MI barcode using **recursive substitution error correction (RSEC)**.

Subsequently, other MI errors (derived from library preparation steps or sequencing base deletions) are adjusted using **distribution-based error correction (DBEC)**.

*To learn more about genomics solutions offered by the company, visit the BD webpage.*

## Recursive substitution error correction (RSEC)

The RSEC algorithm adjusts MI errors that are derived from PCR and sequencing substitution. These rare erroneous events are observed when examining the MI depth.

For example, the MI depth for MIs that are likely errors are significantly lower than true MIs in adequately sequenced samples (**Figure 3**); in cases where two very similar MIs are used during the initial Molecular Indexing (reverse transcription) steps, they would generally have similar MI depth and do not need to be eliminated.

As sequencing depth increases, more MI errors appear, hence RSEC is crucial for adjusting the MI count for highly sequenced barcoded libraries.

RSEC considers two factors in error correction: 1) similarity in MI sequence and 2) MI depth.

For each target gene, MIs are connected when both of their MI sequence is within one base of each other (Hamming distance = 1).

For each connection between MI x and y, if: **MI Depth(y) > 2*MI Depth(x)+1;y is ‘Parent MI’ and x is ‘Child MI’**

Based on this assignment, child MIs are collapsed to their parent MI.

This process is recursive until there are no more identifiable parent / child MIs for the gene (**Figure 4**).

## MI depth calculations

After RSEC, gene MI counts are evaluated to determine their suitability for further correction.

First, the algorithm identifies whether the MI depth for each gene is sufficient for error correction.

According to the Poisson distribution, if MI depth is less than four, more signal MIs are removed than error MIs using any correction algorithm.

Therefore, genes with low MI depth (< 4 reads per MI) bypass subsequent correction steps for DBEC and are designated **‘low depth’** in the output file.

The algorithm also evaluates gene MIs that appear to be **‘saturated’,** where there are far more input transcripts for stochastic unique MI labelling to occur.

Gene MIs that do not meet either of the two decision points move forward to the DBEC algorithm and are designated as **‘pass’** in the output file.

In addition, genes with higher than an average of 650 MIs per well are designated to be **‘high input**’ as >5% of these MIs are based on the Poisson distribution but run through DBEC (**Figure 2A**).

## Distribution-based error correction (DBEC)

Unlike RSEC, DBEC algorithm is a method to discriminate whether a MI is an error or true signal regardless of its MI sequence.

While RSEC uses both MI sequence and MI depth information to correct for errors, DBEC relies only on MI depth to correct for non-substitution errors.

Error barcodes generally have low MI depth that is distinct from true barcodes MI depth; this difference in MI depth can be observed in a histogram plot of the MI depth (**Figure 2B**).

DBEC fits two negative binomial distributions to statistically distinguish between MI errors with lower MI depth and one for true signal with higher MI depth (**Figure 5**).

## Removal of recycled MIs for optimal distribution fitting

The first step removes potential recycled MIs.

A recycled MI is an MI sequence with the same SI that was used more than once for a given gene prior to amplification.

Although these are two unique molecules that should both be counted, because they have the same MI sequence they are folded into one and treated as though one is the amplicon of the other.

For a given gene, as the number of MIs detected increases, the percentage of recycled MIs increases and can be estimated.

Using the Poisson distribution *(λ _{non-unique})*, the number of recycled MIs for well

*i (n*is estimated from the MI recycling rate equation at right.

_{non-unique}, i)Based on the top MI depth, MIs would be eliminated from distribution fitting—but preserved for later counting steps—to obtain a better negative binomial distribution fitting.

## Estimation of parameters

In order to fit two negative binomial distributions (one for error and one for true signal), two sets of starting values for parameter estimation are approximated.

The error distribution is assumed to be a negative binomial with mean and dispersion of one.

An example of how the two distributions are fitted is shown in **Figure 5**.

If there is an instance where DBEC fails, MIs with a depth of one are removed as a simple error correction step.

## Error / signal probability estimation

After distribution fitting, the signal and error distributions are referred to as NB(μ_{signal}, size_{signal}) and NB(μ_{error}, size_{error}), respectively.

To avoid overcorrection, if the derived cutoff for error distribution is higher than half of the maximum read, only MIs with a depth of one are removed.

*To learn more about solutions offered by the company, visit the BD webpage.*

## Sample experiment data

Below is a visualization of a BD^{TM} Precise targeted assay from a 96-well plate of mixed Jurkat and breast cancer (T47D) single cells (86 genes examined). MI adjustment removes MI noise which allows for clear differentiation of gene expression between cell clusters.

The following heat map displays differential gene expression by MIs between different cell clusters identified in Figure 7 [Left] before any error correction steps (Raw MI) and [Right] after RSEC and DBEC correction (Adjusted MI).

Genes that are low in expression are in blue, and genes that are high in expression are orange. Genes that are similar in gene expression pattern between these cell types are clustered together.

Without error correction, NTC can have noise from high expressing genes such as CD3E and KRT18, which are Jurkat and T47D markers, respectively. Moreover, error correction reveals distinct gene expression patterns between Jurkat and T47D.

*To learn more about products and solutions offered by the company, visit the BD Genomics webpage.*