Frequently Asked Questions

I am having problems with Dask

An alternative is to use the multiprocessing implementation of Arboreto included in pySCENIC (arboreto_with_multiprocessing.py). This scrips is also available on the path once pySCENIC is installed. This script uses the Arboreto and pySCENIC codebase to run GRNBoost2 (or GENIE3) without Dask. The eliminates the possibility of running the GRN step across multiple nodes, but brings provides additional stability. The run time is generally equivalent to the Dask implementation using the same number of workers.

The basic usage is:

arboreto_with_multiprocessing.py \
    expr_mat.loom \
    allTFs_hg38.txt \
    --method grnboost2 \
    --output adj.tsv \
    --num_workers 20 \
    --seed 777

The algorithm can be selected using the --method option (genie3 or grnboost2). Possible input formats for the expression data are the same as for the pySCENIC CLI: loom, and csv.

How can I prioritize the target genes within a particular regulon?

It can be useful to have a better idea of which particular target genes within a regulon are more important, especially in the cases of regulons with many genes. There are multiple possibilities to address this.

  1. iRegulon analysis. One possibility is to take the pre-refinement modules for the regulon of interest, and export them for analysis in iRegulon. There are six unrefined modules created from the GRN output for each TF of interest, which are generated using multiple approaches. The use of iRegulon for the pruning/refinement allows for more control over the pruning of these modules and possibly a better idea of which genes are more important. See the last section, Further exploration of modules directly from the network inference output, in this notebook for an example of how to get started. A full tutorial on module refinement with iRegulon can be found here.
  2. pySCENIC multi-runs. Another approach is to run the whole pySCENIC procedure multiple times (~10-100x). Because of the stochastic nature of the GRN step in particular, a slightly varying result is produced for each run, both in terms of the overall regulons found, as well as the target genes for a particular regulon. The target genes (and regulons themselves) can then be scored by the number of times it occurs across all runs, and considered as ‘high confidence’ if they occur in >80% of runs, for example. This has the potential to be very computationally intensive for a whole dataset, but if only a few regulons are of interest, pySCENIC can be run using only these (by limiting the TFs included in the TFs file that goes into the GRN step). The multi-runs capability is implemented in the SCENIC section of our single cell Nextflow pipeline. The entrypoint scenic_multiruns provides an automated way to run this procedure.

How can I create a SCope-compatible loom file?

pySCENIC is capable of reading and writing loom files. However, the loom file created in the AUCell step has the pySCENIC results embedded, but no dimensionality reductions are included. In order to create a loom file that is ready to be uploaded to the SCope viewer, we can use a helper script from VSN Pipelines. These need to be downloaded:

wget https://raw.githubusercontent.com/vib-singlecell-nf/vsn-pipelines/master/src/scenic/bin/add_visualization.py
wget https://raw.githubusercontent.com/vib-singlecell-nf/vsn-pipelines/master/src/scenic/bin/export_to_loom.py

The add_visualization.py script will take as input the loom file created by pyscenic aucell and add a basic UMAP and t-SNE based on the SCENIC AUCell matrix. Some additional packages are required for this, in particular MulticoreTSNE and umap. The usage is as follows:

python add_visualization.py \
    --loom_input auc_mtx.loom \
    --loom_output scenic_visualize.loom \
    --num_workers 8

The output loom file is then ready for use in SCope.

This can also be easily run using the Docker image, which already contains all of the necessary packages (it’s still necessary to download the above python files, however):

docker run -it --rm -v $PWD:$PWD -w $PWD aertslab/pyscenic:0.12.1 \
    python add_visualization.py \
        --loom_input auc_mtx.loom \
        --loom_output scenic_visualize.loom \
        --num_workers 8

Can I create my own ranking databases?

Yes you can. See create_cisTarget_databases for more detailed instructions to create custom cisTarget databases.

Can I draw the distribution of AUC values for a regulon across cells?

import pandas as pd
import matplotlib.pyplot as plt


def plot_binarization(auc_mtx: pd.DataFrame, regulon_name: str, threshold: float, bins: int=200, ax=None) -> None:
    """
    Plot the "binarization" process for the given regulon.

    :param auc_mtx: The dataframe with the AUC values for all cells and regulons (n_cells x n_regulons).
    :param regulon_name: The name of the regulon.
    :param bins: The number of bins to use in the AUC histogram.
    :param threshold: The threshold to use for binarization.
    """
    if ax is None:
        ax=plt.gca()
    auc_mtx[regulon_name].hist(bins=bins,ax=ax)

    ylim = ax.get_ylim()
    ax.plot([threshold]*2, ylim, 'r:')
    ax.set_ylim(ylim)
    ax.set_xlabel('AUC')
    ax.set_ylabel('#')
    ax.set_title(regulon_name)