ACDC in Python

acdc_py

acdc_py.GS

acdc_py.GS(adata, res_vector=array([0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7, 1.9]), NN_vector=array([11, 21, 31, 41, 51, 61, 71, 81, 91, 101]), dist_slot=None, use_reduction=True, reduction_slot='X_pca', metrics='sil_mean', opt_metric='sil_mean', opt_metric_dir='max', cluster_labels=None, cluster_name=None, seed=0, key_added='clusters', approx_size=None, verbose=True, show_progress_bar=True, batch_size=1000, njobs=1)[source]

A tool for the optimization-based unsupervised clustering of large-scale data. Grid Search (GS) allows for deterministic optimization of several variables—Nearest Neighbors and resolution–with several objective functions—e.g. Silhouette Score. An approximation method we call subsampling and diffusion is included to allow fast and accurate clustering of hundreds of thousands of cells.

Parameters:
  • adata – An anndata object containing a gene expression signature in adata.X and gene expression counts in adata.raw.X.

  • res_vector (default: np.arange(0.1, 2, 0.2)) – sequence of values of the resolution parameter.

  • NN_vector (default: np.arange(11, 102, 10)) – sequence of values for the number of nearest neighbors.

  • dist_slot (default: None) – Slot in adata.obsp where a pre-generated distance matrix computed across all cells is stored in adata for use in construction of NN. (Default = None, i.e. distance matrix will be automatically computed as a correlation distance and stored in “corr_dist”).

  • use_reduction (default: True) – Whether to use a reduction (True) (highly recommended - accurate & much faster) or to use the direct matrix (False) for clustering.

  • reduction_slot (default: "X_pca") – If reduction is TRUE, then specify which slot for the reduction to use.

  • metrics (default: "sil_mean") – A metric or a list of metrics to be computed at each iteration of the GridSearch. Possible metrics to use include “sil_mean”, “sil_mean_median”, “tot_sil_neg”, “lowest_sil_clust”, “max_sil_clust”, “ch” and “db”.

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • cluster_labels (default: None) – A column in adata.obs with a set of cluster labels containing a cluster to subcluster. Specify the cluster with the cluster_name parameter.

  • cluster_name (default: None) – A cluster from cluster_labels to subcluster. When None, cluster whole dataset.

  • seed (default: 0) – Random seed to use.

  • key_added (default: "clusters") – Slot in obs to store the resulting clusters.

  • approx_size (default: None) – When set to a positive integer, instead of running GS on the entire dataset, perform GS on a subsample and diffuse those results. This will lead to an approximation of the optimal solution for cases where the dataset is too large to perform GS on due to time or memory constraints.

  • verbose (default: True) – Include additional output with True. Alternative = False.

  • show_progress_bar (default: True) – Show a progress bar to visualize the progress of the algorithm.

  • batch_size (default: 1000) – The size of each batch. Larger batches result in more memory usage. If None, use the whole dataset instead of batches.

  • njobs (default: 1) – Paralleization option that allows users to speed up runtime.

Returns:

  • A object of (class:~anndata.Anndata containing a clustering vector)

  • ”clusters” in the .obs slot and a dictionary “GS_results_dict” with

  • information on the run in the .uns slot.

acdc_py.SA

acdc_py.SA(adata, res_range=[0.1, 1.9], NN_range=[11, 101], dist_slot=None, use_reduction=True, reduction_slot='X_pca', metrics='sil_mean', opt_metric='sil_mean', opt_metric_dir='max', cluster_labels=None, cluster_name=None, maxiter=20, initial_temp=5230, restart_temp_ratio=2e-05, visit=2.62, accept=-5.0, maxfun=10000000.0, seed=0, key_added='clusters', approx_size=None, verbose=True, show_progress_bar=True, batch_size=1000, njobs=1)[source]

A tool for the optimization-based unsupervised clustering of large-scale data. Simulated Annealing (SA) allows for stochastic optimization of several variables—Nearest Neighbors and resolution–with several objective functions—e.g. Silhouette Score. An approximation method we call subsampling and diffusion is included to allow fast and accurate clustering of hundreds of thousands of cells.

Parameters:
  • adata – An anndata object containing a gene expression signature in adata.X and gene expression counts in adata.raw.X.

  • res_range (default: [0.1, 1.9]) – edge values of the search space for the resolution parameter.

  • NN_range (default: [11, 101]) – edge values of the search space for the nearest neighbors parameter.

  • dist_slot (default: None) – Slot in adata.obsp where a pre-generated distance matrix computed across all cells is stored in adata for use in construction of NN. (Default = None, i.e. distance matrix will be automatically computed as a correlation distance and stored in “corr_dist”).

  • use_reduction (default: True) – Whether to use a reduction (True) (highly recommended - accurate & much faster) or to use the direct matrix (False) for clustering.

  • reduction_slot (default: "X_pca") – If reduction is TRUE, then specify which slot for the reduction to use.

  • metrics (default: "sil_mean") – A metric or a list of metrics to be computed at each iteration of the GridSearch. Possible metrics to use include “sil_mean”, “sil_mean_median”, “tot_sil_neg”, “lowest_sil_clust”, “max_sil_clust”, “ch” and “db”.

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • cluster_labels (default: None) – A column in adata.obs with a set of cluster labels containing a cluster to subcluster. Specify the cluster with the cluster_name parameter.

  • cluster_name (default: None) – A cluster from cluster_labels to subcluster. When None, cluster whole dataset.

  • maxiter (: default: 20) – The maximum number of global search iterations. If None, value is 1000.

  • minimizer_kwargs (dict, optional) – Extra keyword arguments to be passed to the local minimizer (minimize). Some important options could be: method for the minimizer method to use and args for objective function additional arguments.

  • initial_temp (float, optional) – The initial temperature, use higher values to facilitates a wider search of the energy landscape, allowing dual_annealing to escape local minima that it is trapped in. Default value is 5230. Range is (0.01, 5.e4].

  • restart_temp_ratio (float, optional) – During the annealing process, temperature is decreasing, when it reaches initial_temp * restart_temp_ratio, the reannealing process is triggered. Default value of the ratio is 2e-5. Range is (0, 1).

  • visit (float, optional) – Parameter for visiting distribution. Default value is 2.62. Higher values give the visiting distribution a heavier tail, this makes the algorithm jump to a more distant region. The value range is (1, 3].

  • accept (float, optional) – Parameter for acceptance distribution. It is used to control the probability of acceptance. The lower the acceptance parameter, the smaller the probability of acceptance. Default value is -5.0 with a range (-1e4, -5].

  • maxfun (int, optional) – Soft limit for the number of objective function calls. If the algorithm is in the middle of a local search, this number will be exceeded, the algorithm will stop just after the local search is done. Default value is 1e7.

  • seed (default: 0) – Random seed to use.

  • key_added (default: "clusters") – Slot in obs to store the resulting clusters.

  • approx_size (default: None) – When set to a positive integer, instead of running GS on the entire dataset, perform GS on a subsample and diffuse those results. This will lead to an approximation of the optimal solution for cases where the dataset is too large to perform GS on due to time or memory constraints.

  • verbose (default: True) – Include additional output with True. Alternative = False.

  • show_progress_bar (default: True) – Show a progress bar to visualize the progress of the algorithm.

  • batch_size (default: 1000) – The size of each batch. Larger batches result in more memory usage. If None, use the whole dataset instead of batches.

  • njobs (default: 1) – Paralleization option that allows users to speed up runtime.

Returns:

  • A object of (class:~anndata.Anndata containing a clustering vector)

  • ”clusters” in the .obs slot and a dictionary “GS_results_dict” with

  • information on the run in the .uns slot.

acdc_py.get_opt

acdc_py.get_opt.SA_clustering(adata, dist_slot=None, use_reduction=True, reduction_slot='X_pca', opt_metric='sil_mean', opt_metric_dir='max', cluster_labels=None, cluster_name=None, n_clusts=None, seed=0, approx_size=None, key_added='clusters', knn_slot='knn', verbose=True, njobs=1)[source]

Get the clustering using the parameters found by optimizing a particular metric using the results produced by the SA function. Note that this requires running the acdc.SA function first in order to produce adata.uns[‘SA_results_dict’].

Parameters:
  • adata – An anndata object containing a distance object in adata.obsp.

  • dist_slot (default: None) – Slot in adata.obsp where a pre-generated distance matrix computed across all cells is stored in adata for use in construction of NN. (Default = None, i.e. distance matrix will be automatically computed as a correlation distance and stored in “corr_dist”).

  • use_reduction (default: True) – Whether to use a reduction (True) (highly recommended - accurate & much faster) or to use the direct matrix (False) for clustering.

  • reduction_slot (default: "X_pca") – If reduction is TRUE, then specify which slot for the reduction to use.

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • cluster_labels (default: None) – A column in adata.obs with a set of cluster labels containing a cluster to subcluster. Specify the cluster with the cluster_name parameter.

  • cluster_name (default: None) – A cluster from cluster_labels to subcluster. When None, cluster whole dataset.

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to compute the optimal clustering solution with this many clusters.

  • seed (default: 0) – Random seed to use.

  • approx_size (default: None) – When set to a positive integer, instead of running GS on the entire dataset, perform GS on a subsample and diffuse those results. This will lead to an approximation of the optimal solution for cases where the dataset is too large to perform GS on due to time or memory constraints.

  • key_added (default: "clusters") – Slot in obs to store the resulting clusters.

  • knn_slot (default: "knn") – Slot in uns that stores the KNN array used to compute a neighbors graph (i.e. adata.obs[‘connectivities’]).

  • verbose (default: True) – Include additional output with True. Alternative = False.

  • njobs (default: 1) – Paralleization option that allows users to speed up runtime.

Returns:

  • Adds fields to the input adata, such that it contains the clustering stored

  • in adata.obs[key_added].

acdc_py.get_opt.SA_params(adata, opt_metric='sil_mean', opt_metric_dir='max', n_clusts=None)[source]

Get the optimal parameters for clustering found by optimizing a particular metric using the results produced by the SA function. Note that this requires running the acdc.SA function first in order to produce adata.uns[‘SA_results_dict’].

Parameters:
  • adata – An anndata object containing the results of the acdc.SA function in adata.uns[‘SA_results_dict’].

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to retrieve the parameters for optimal clustering with this many clusters.

Returns:

  • A dictionary with keys opt_res and opt_knn and the corresponding values that

  • produce the requested clustering solution.

acdc_py.get_opt.SA_metric_value(adata, opt_metric='sil_mean', opt_metric_dir='max', n_clusts=None)[source]

Get the optimal value for a particular metric found when using parameters for clustering that optimize said metric. This will be identified in using the results produced by the SA function. Note that this therefore requires running acdc.GS first in order to produce adata.uns[‘SA_results_dict’].

Parameters:
  • adata – An anndata object containing the results of the acdc.SA function in adata.uns[‘SA_results_dict’].

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to retrieve the parameters for optimal clustering with this many clusters.

Return type:

The value of the metric when optimized.

acdc_py.get_opt.SA_metric_search_data(adata, opt_metric='sil_mean', opt_metric_dir='max', n_clusts=None)[source]

Get the optimal parameters for clustering along with all their associated statistics. These will be found by optimizing a particular metric using the results produced by the SA function. Note that this requires running the acdc.SA function first in order to produce adata.uns[‘SA_results_dict’].

Parameters:
  • adata – An anndata object containing the results of the acdc.SA function in adata.uns[‘SA_results_dict’].

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to retrieve the parameters for optimal clustering with this many clusters.

Returns:

  • A pandas series containing the resolution and knn that produce the requested

  • clustering solution along with all other metrics.

acdc_py.get_opt.GS_clustering(adata, dist_slot=None, use_reduction=True, reduction_slot='X_pca', opt_metric='sil_mean', opt_metric_dir='max', cluster_labels=None, cluster_name=None, n_clusts=None, seed=0, approx_size=None, key_added='clusters', knn_slot='knn', verbose=True, njobs=1)[source]

Get the clustering using the parameters found by optimizing a particular metric using the results produced by the GS function. Note that this requires running the acdc.GS function first in order to produce adata.uns[‘GS_results_dict’].

Parameters:
  • adata – An anndata object containing a distance object in adata.obsp.

  • dist_slot (default: None) – Slot in adata.obsp where a pre-generated distance matrix computed across all cells is stored in adata for use in construction of NN. (Default = None, i.e. distance matrix will be automatically computed as a correlation distance and stored in “corr_dist”).

  • use_reduction (default: True) – Whether to use a reduction (True) (highly recommended - accurate & much faster) or to use the direct matrix (False) for clustering.

  • reduction_slot (default: "X_pca") – If reduction is TRUE, then specify which slot for the reduction to use.

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • cluster_labels (default: None) – A column in adata.obs with a set of cluster labels containing a cluster to subcluster. Specify the cluster with the cluster_name parameter.

  • cluster_name (default: None) – A cluster from cluster_labels to subcluster. When None, cluster whole dataset.

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to compute the optimal clustering solution with this many clusters.

  • seed (default: 0)) – Random seed to use.

  • approx_size (default: None) – When set to a positive integer, instead of running GS on the entire dataset, perform GS on a subsample and diffuse those results. This will lead to an approximation of the optimal solution for cases where the dataset is too large to perform GS on due to time or memory constraints.

  • key_added (default: "clusters") – Slot in obs to store the resulting clusters.

  • knn_slot (default: "knn") – Slot in uns that stores the KNN array used to compute a neighbors graph (i.e. adata.obs[‘connectivities’]).

  • verbose (default: True) – Include additional output with True. Alternative = False.

  • njobs (default: 1) – Paralleization option that allows users to speed up runtime.

Returns:

  • Adds fields to the input adata, such that it contains the clustering stored

  • in adata.obs[key_added].

acdc_py.get_opt.GS_params(adata, opt_metric='sil_mean', opt_metric_dir='max', n_clusts=None)[source]

Get the optimal parameters for clustering found by optimizing a particular metric using the results produced by the GS function. Note that this requires running the acdc.GS function first in order to produce adata.uns[‘GS_results_dict’].

Parameters:
  • adata – An anndata object containing the results of the acdc.GS function in adata.uns[‘GS_results_dict’].

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to retrieve the parameters for optimal clustering with this many clusters.

Returns:

  • A dictionary with keys opt_res and opt_knn and the corresponding values that

  • produce the requested clustering solution.

acdc_py.get_opt.GS_metric_value(adata, opt_metric='sil_mean', opt_metric_dir='max', n_clusts=None)[source]

Get the optimal value for a particular metric found when using parameters for clustering that optimize said metric. This will be identified in using the results produced by the GS function. Note that this therefore requires running acdc.GS first in order to produce adata.uns[‘GS_results_dict’].

Parameters:
  • adata – An anndata object containing the results of the acdc.GS function in adata.uns[‘GS_results_dict’].

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to retrieve the parameters for optimal clustering with this many clusters.

Return type:

The value of the metric when optimized.

acdc_py.get_opt.GS_metric_search_data(adata, opt_metric='sil_mean', opt_metric_dir='max', n_clusts=None)[source]

Get the optimal parameters for clustering along with all their associated statistics. These will be found by optimizing a particular metric using the results produced by the GS function. Note that this requires running the acdc.GS function first in order to produce adata.uns[‘GS_results_dict’].

Parameters:
  • adata – An anndata object containing the results of the acdc.SA function in adata.uns[‘GS_results_dict’].

  • opt_metric (default: "sil_mean") – A metric from metrics to use to optimize parameters for the clustering.

  • opt_metric_dir (default: "max") – Whether opt_metric is more optimal by maximizing (“max”) or by minimizing (“min”).

  • n_clusts (default: None) – If not None, restrict the search space to the number of clusters equal to n_clusts in order to retrieve the parameters for optimal clustering with this many clusters.

Returns:

  • A pandas series containing the resolution and knn that produce the requested

  • clustering solution along with all other metrics.

acdc_py.pp

acdc_py.pp.corr_distance(adata, use_reduction=True, reduction_slot='X_pca', key_added='corr_dist', batch_size=1000, dtype=<class 'numpy.int16'>, verbose=True)[source]

A tool for computing a distance matrix based on pearson correlation.

Parameters:
  • adata – An anndata object containing a signature in adata.X

  • use_reduction (default: True) – Whether to use a reduction (True) (highly recommended - accurate & much faster) or to use the direct matrix (False) for computing distance.

  • reduction_slot (default: "X_pca") – If reduction is TRUE, then specify which slot for the reduction to use.

  • key_added (default: "corr_dist") – Slot in obsp to store the resulting distance matrix.

  • batch_size (default: 1000) – Reduce total memory usage by running data in batches.

  • dtype (default: np.int16) – Data type used to represent the distance values. np.int16 (default) is a compromise between smaller memory size while not reducing information so much as to affect clustering. dtypes include np.int8, np.int16 (default) np.int32, np.int64, np.float16, np.float32, and np.float64.

  • verbose (default: True) – Show a progress bar for each batch of data.

Returns:

  • Adds fields to the input adata, such that it contains a distance matrix

  • stored in adata.obsp[key_added].

acdc_py.pp.neighbors_knn(adata, max_knn=101, dist_slot='corr_dist', key_added='knn', batch_size=1000, verbose=True, njobs=1)[source]

A tool for computing a KNN array used to then rapidly generate connectivity graphs with acdc.pp.neighbors_graph for clustering.

Parameters:
  • adata – An anndata object containing a distance object in adata.obsp.

  • max_knn (default: 101) – The maximum number of k-nearest neighbors (knn) to include in this array. acdc.pp.neighbors_graph will only be able to compute KNN graphs with knn <= max_knn.

  • dist_slot (default: "corr_dist") – The slot in adata.obsp where the distance object is stored. One way of generating this object is with adata.pp.corr_distance.

  • key_added (default: "knn") – Slot in uns to store the resulting knn array.

  • batch-size (default: 1000) – Size of the batches used to reduce memory usage.

  • verbose (default: True) – Whether to display a progress bar of the batches completed.

  • njobs (default: 1) – Paralleization option that allows users to speed up runtime.

Returns:

  • Adds fields to the input adata, such that it contains a knn array stored in

  • adata.uns[key_added].

acdc_py.pp.neighbors_graph(adata, n_neighbors=15, knn_slot='knn', batch_size=1000, verbose=True)[source]

A tool for rapidly computing a k-nearest neighbor (knn) graph (i.e. connectivities) that can then be used for clustering.

graphs with acdc.pp.neighbors_graph for clustering.

Parameters:
  • adata – An anndata object containing a distance object in adata.obsp.

  • n_neighbors (default: 15) – The number of nearest neighbors to use to build the connectivity graph. This number must be less than the total number of knn in the knn array stored in adata.uns[knn_slot].

  • knn_slot (default: 101) – The slot in adata.uns where the knn array is stored. One way of generating this object is with acdc.pp.neighbors_knn.

  • batch-size (default: 1000) – Size of the batches used to reduce memory usage.

  • verbose (default: True) – Whether to display a progress bar of the batches completed.

Returns:

  • Adds fields to the input adata, such that it contains a knn graph stored in

  • adata.obsp[‘connectivities’] along with metadata in adata.uns[“neighbors”].

acdc_py.pl

acdc_py.pl.GS_search_space(adata, plot_type='sil_mean')[source]

Get a heatmap of the search space traversed by Grid Search (GS).

Parameters:
  • adata – An anndata object that was previously given to GS

  • plot_type (default: "sil_mean") – A column name in adata.uns[“GS_results_dict”][“search_df”]. Among other, options include “sil_mean” and “n_clust”.

Returns:

A object of

Return type:

class:~matplotlib.figure.Figure containing the plot.

acdc_py.pl.SA_search_space(adata, plot_type='sil_mean', plot_density=True)[source]

Get a dot plot of the search space traversed by Simulated Annealing (SA).

Parameters:
  • adata – An anndata object that was previously given to GS

  • plot_type (default: "sil_mean") – A column name in adata.uns[“GS_results_dict”][“search_df”]. Among other, options include “sil_mean” and “n_clust”.

  • plot_density (default: True) – Whether to plot density on the dotplot to identify regions that were highly traversed by SA.

Returns:

A object of

Return type:

class:~matplotlib.figure.Figure containing the plot.

acdc_py.pl.metric_vs_n_clusts(adata, metric='sil_mean', width=5, height=5, xlabel='number of clusters', ylabel=None, axis_fontsize=14)[source]

Get a dot plot of the search space traversed by Simulated Annealing (SA).

Parameters:
  • adata – An anndata object that was previously given to GS

  • metric (default: "sil_mean") – A column name in adata.uns[“GS_results_dict”][“search_df”]. Among other, options include “sil_mean”.

  • width (default: 5) – Figure width (inches)

  • height (default: 5) – Figure height (inches)

  • xlabel (default: 'number of clusters') – x-axis label

  • ylabel (default: None) – When None, ylabel will be metric.

  • axis_fontsize (default: 14) – Fontsize for xlabel and ylabel.

acdc_py.pl.silhouette_scores(adata, groupby, dist_slot, palette=None, ylab=None, show=True)[source]

Get a dot plot of the search space traversed by Simulated Annealing (SA).

Parameters:
  • adata – An anndata object.

  • groupby – A name of the column in adata.obs that contains the clustering that you want to calculate silhouette scores for.

  • dist_slot – The slot in adata.obsp where the distance object that will be used to calculate the silhouette score is stored.

  • palette (default: None) – The name of a Matplotlib qualitative colormap. If None, use ACDC default palette.

  • ylab (default: None) – The label to put on the y-axis.

  • show (default: True) – Whether to show the plot.

acdc_py.tl

acdc_py.tl.cluster_final(adata, res, knn, dist_slot=None, use_reduction=True, reduction_slot='X_pca', seed=0, approx_size=None, key_added='clusters', knn_slot='knn', verbose=True, batch_size=1000, njobs=1)[source]

A tool for replicating the final optimization-based unsupervised clustering of large-scale data performed by the Grid Search (GS) or Simulated Annealing (SA) functions.

Parameters:
  • adata – An anndata object containing a gene expression signature in adata.X and gene expression counts in adata.raw.X.

  • res – sequence of values of the resolution parameter.

  • knn – sequence of values for the number of nearest neighbors.

  • dist_slot (default: None) – Slot in adata.obsp where a pre-generated distance matrix computed across all cells is stored in adata for use in construction of NN. (Default = None, i.e. distance matrix will be automatically computed as a correlation distance and stored in “corr_dist”).

  • use_reduction (default: True) – Whether to use a reduction (True) (highly recommended - accurate & much faster) or to use the direct matrix (False) for clustering.

  • reduction_slot (default: "X_pca") – If reduction is TRUE, then specify which slot for the reduction to use.

  • seed (default: 0) – Random seed to use.

  • key_added (default: "clusters") – Slot in obs to store the resulting clusters.

  • knn_slot (default: "knn") – Slot in uns that stores the KNN array used to compute a neighbors graph (i.e. adata.obs[‘connectivities’]).

  • approx_size (default: None) – When set to a positive integer, instead of running GS on the entire dataset, perform GS on a subsample and diffuse those results. This will lead to an approximation of the optimal solution for cases where the dataset is too large to perform GS on due to time or memory constraints.

  • batch_size (default: 1000) – The size of each batch. Larger batches result in more memory usage. If None, use the whole dataset instead of batches.

  • verbose (default: True) – Include additional output with True. Alternative = False.

Returns:

  • A object of (class:~anndata.Anndata containing a clustering vector)

  • ”clusters” in the .obs slot.

acdc_py.tl.extract(adata, groupby, clusters)[source]

Extract clusters as a new AnnData object. Useful for subclustering.

Parameters:
  • adata – An anndata object containing a gene expression signature in adata.X and gene expression counts in adata.raw.X.

  • groupby – A name of the column in adata.obs.

  • clusters – Names of clusters in adata.obs[groupby] to extract.

acdc_py.tl.merge(adata, groupby, clusters, merged_name=None, update_numbers=True, key_added='clusters', return_as_series=False)[source]

Merge clusters together and, if desired, renumber the clusters based on cluster size.

Parameters:
  • adata – An anndata object containing a gene expression signature in adata.X and gene expression counts in adata.raw.X.

  • groupby – A name of the column in adata.obs.

  • clusters – Names of clusters in adata.obs[groupby] to extract.

  • merged_name (default: None) – The name of the new cluster. If None with digit clusters, the new cluster will be named after the smallest of the merged. If None with non-digit clusters, the new cluster will be named by joining the names of the clusters.

  • update_numbers (default: True) – If clusters are digits, renumber the clusters based on cluster size.

  • key_added (default: "clusters") – Store the new clustering in adata.obs[key_added].

  • return_as_series (default: False) – Rather than storing the clusters, return them as a pd.Series object.

acdc_py.tl.rename(adata, groupby, name_dict)[source]

Rename clusters within adata.obs[groupby] using name_dict to specify the mapping between old and new names.

acdc_py.config

acdc_py.config.set_SS_bootstraps(n_subsamples=1, subsamples_pct_cells=100)[source]
n_subsamplesdefault: 1

Number of subsamples per bootstrap.

subsamples_pct_cellsdefault: 100

Percentage of cells sample at each bootstrap iteration. i.e. when 100, 100%, all cells are used).

acdc_py.config.set_SS_weights(SS_weights='unitary', SS_exp_base=2.718282)[source]
SS_weightsdefault: “unitary”

Negative silhouette scores can be given more weight by exponentiation (“exp”). Otherwise, leave SS_weights as “unitary”.

SS_exp_basedefault: 2.718282.

If SS_weights is set to “exp”, then set the base for exponentiation.

acdc_py.config.set_clust_alg(clust_alg='Leiden')[source]

clust_alg : default: “Leiden” Clustering algorithm. Choose among: “Leiden” (default) or “Louvain”.

acdc_py.config.set_corr_distance_dtype(dtype=<class 'numpy.int16'>)[source]

dtype : default: np.int16 Data type used to represent the distance values. np.int16 (default) is a compromise between smaller memory size while not reducing information so much as to affect clustering. dtypes include np.int8, np.int16 (default) np.int32, np.int64, np.float16, np.float32, and np.float64.