Classifying cells with Garnett

So far we have:

imported a CDS object
corrected for batch effects
put our data into lower dimensional space
clustered and visualized our cells in UMAP space

Today we will learn how to:

use Garnett to classify cells with a pre-trained classifier and by training our own classifier
visualize cells with cell labels

We can run clustering algorithms on our cells to identify “cell types” or cells that are similar to each other based on their shared gene expression. Without manually inspecting what genes are unique to cluster, it is difficult to assign cell types. Manually assigning cell clusters to known cell types based on canonical gene expression is a very laborious process. Luckily, the Trapnell lab has developed the software package Garnett.

Citation:
Pliner, H.A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat Methods 16, 983–986 (2019). https://doi.org/10.1038/s41592-019-0535-3

Garnett is a software package that facilitates automated cell type classification from single cell gene expression data. Garnett works by taking single-cell data, along with a cell type definition (marker) file, and training a regression-based classifier. Once a classifier is trained for a tissue/sample type, it can be applied to classify future datasets from similar tissues. In addition to describing training and classifying functions, this website aims to be a repository of previously trained classifiers.

There are two options when you are starting: garnett Image Source: https://cole-trapnell-lab.github.io/garnett/classifiers/

# ignore warnings
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")

# load the necessary packages
library(monocle3)

# install garnett
devtools::install_github("cole-trapnell-lab/garnett", ref="monocle3")
library(garnett)

# install Mouse gene annotation
BiocManager::install(c('org.Mm.eg.db'))
library("org.Mm.eg.db")

BiocManager::install(c('ggplot2'))
library(ggplot2)

# use function install.packages("package_name") if package not previously installed
library(tidyverse)
library(ggplot2)
library(tidyr)
library(viridis)
library(ggridges)
library(RColorBrewer)
library(ggrepel)
library(pheatmap)

The Garnett workflow has two major parts, each described in detail below:

Train/obtain the classifier:

Download an existing classifier, or train your own. To train, Garnett parses a marker file, chooses a set of training cells, and then trains a multinomial classifier to distinguish cell types.
Classify cells: Next, Garnett applies the classifier to a set of cells to generate cell type assignments. Optionally, Garnett will expand classifications to similar cells to generate a separate set of cluster-extended type assignments.

First, let’s download the “mmlung” marker file (https://cole-trapnell-lab.github.io/garnett/classifiers/). Save this marker file to your “scRNAseq_project” folder. Open up the marker file and take a look at it.

The marker file contains a list of cell type definitions written in an easy-to-read text format. The cell type definitions tell Garnett how to choose cells to train the model on. Each cell type definition starts with a ‘>’ symbol and the cell type name, followed by a series of lines with definition information. Definition lines start with a keyword and a ‘:’ and entries are separated by a comma. Note: the Garnett syntax allows for entries following the ‘:’ to move onto following lines however, you may not move to a new line mid entry (i.e. you can go to a new line only after a comma).

There are several ways to define cell types in the Garnett marker file format. In general, each cell’s definition can have three major components. Only the first component is required. The first and most important specification for a cell type is its expression. Garnett offers several options for specifying marker genes, detailed below.

Format:
expressed: gene1, gene2
not expressed: gene1, gene2

Example:
expressed: MYOD1, MYH3
not expressed: PAX6, PAX3

In addition to expression information, you can further refine your cell type definitions using meta data. This is also where you will specify any subtypes you expect in your data.

subtype of: allows you to specify that a cell type is a subtype of another cell type in your definition file.

custom meta data: specification allows you to provide any further meta data requirements for your cell type. Any column in the pData table of your CDS object can be used as a meta data specification. In the example above, there would be a column in the pData table called “tissue”.

Lastly, we highly recommend that you document how you chose your marker definitions. To make it easier to keep track of, we provide an additional specification - references: - that will store your citation information for each cell type.

A more complex example:

B cells
expressed: CD19, MS4A1
expressed above: CD79A 10
references: https://www.abcam.com/primary-antibodies/b-cells-basic-immunophenotyping, 10.3109/07420528.2013.775654

T cells
expressed: CD3D
sample: blood # A meta data specification

Helper T cells
expressed: CD4
subtype of: T cells
references: https://www.ncbi.nlm.nih.gov/pubmed/?term=12000723

Classify our cells by training our own classifier with a marker file

# import cds we saved at the end of lesson 2
cds <- readRDS(<filepath>/<filename>.rds)

# define marker file path
marker_file_path <- "/Users/elizabarkan/Desktop/mmLung_markers.txt"

# check our markers
marker_check <- check_markers(cds, marker_file_path,
                              db=org.Mm.eg.db,
                              cds_gene_id_type = "SYMBOL", # in cds: read gene symbol or ensembl id
                              marker_file_gene_id_type = "SYMBOL") # in marker file: read gene symbol or ensembl id

# check ambiguity scores for markers (lower ambiguity scores are more desirable
plot_markers(marker_check)

(insert something about how to deal with high ambiguity scores)

maybe make some corrections to the marker file to make it better ```{r}
train a classifier with our marker file

our_lung_classifier <- train_cell_classifier(cds = cds, marker_file = marker_file_path, db=org.Mm.eg.db,, cds_gene_id_type = “SYMBOL”, num_unknown = 50, marker_file_gene_id_type = “SYMBOL”)

run the classifier on the cells in our cds

cds_manual <- classify_cells(cds, our_lung_classifier, db=org.Mm.eg.db, cluster_extend = TRUE, cds_gene_id_type = “SYMBOL”)

look at cell classifications

head(pData(cds_manual))

table of how many cells are classified into each major type

table(pData(cds_manual)$cell_type)

table of how many cells are classified into each sub type

table(pData(cds_manual)$cluster_ext_type)

Next, let's visualize cells colored by cell type
```{r}
# visualize cells by major cell type
plot_cells(cds_manual, color_cells_by = "cell_type") + theme_bw()

manual_cell_type

# visualize cells by cell sub type
plot_cells(cds_manual, color_cells_by = "cell_type") + theme_bw()

manual_ext_type

Classify cells with a pre-trained classifier

Alternatively, we could have used the classifier already trained on cells, instead of training our own classifier. Luckily, in our case there exists a classifier run on cells of the same species, age and tissue but this is not always the case. Submit your marker file and classifiers if you make a new one to help increase the size of the marker file and classifier database (https://cole-trapnell-lab.github.io/garnett/docs/#submitting-a-classifier)!

Let’s download the “mmlung” classifier (https://cole-trapnell-lab.github.io/garnett/classifiers/). Save this classifier to your “scRNAseq_project” folder.

lung_classifier <- readRDS("<filepath>/mmLung_classifier.RDS")

Let’s classify the cells with the lung classifier we imported

cds_auto <- classify_cells(cds, lung_classifier,
                           db=org.Mm.eg.db,
                           cluster_extend = TRUE,
                           cds_gene_id_type = "SYMBOL")
head(pData(cds_auto))
table(pData(cds_auto)$cell_type)
table(pData(cds_auto)$cluster_ext_type)

Next, let’s visualize cells colored by cell type

plot_cells(cds_auto, color_cells_by = "cell_type") + theme_bw()

auto_cell_type

plot_cells(cds_auto, color_cells_by = "cell_type") + theme_bw()

auto_ext_type

Visualize Different Annotations of the Same Cells with Different Annotation Methods

Say we have two methods of annotating cells and we want to compare them– specifically identify which cell types are commonly labaled as one cell type with one method and another cell type with another method. For example, it may be that both classifiers do a great job with classifying one cell type but another cell type is more ambigous so the classifier perform differently. We can create a “Confusion Matrix” to visualize these differences.

Code provided by Sanjay Srivatsan

# append cell annotations to the original cds
cds$manual_cell_type = pData(cds_manual)$cell_type
cds$manual_cluster_ext_type = pData(cds_manual)$cluster_ext_type

cds$auto_cell_type = pData(cds_auto)$cell_type
cds$auto_cluster_ext_type = pData(cds_auto)$cluster_ext_type

# Make a matrix to compare manual vs. already made classifier (auto)
matrix_for_heatmap =
  colData(cds) %>%
  as.data.frame() %>%
  group_by(manual_cluster_ext_type) %>%
  add_tally(name = "num_manual_cell_type") %>%
  ungroup() %>%
  group_by(manual_cluster_ext_type, auto_cluster_ext_type) %>%
  mutate(percent_nn_in_garnett = n()/num_manual_cell_type) %>%
  dplyr::select(auto_cluster_ext_type, manual_cluster_ext_type, percent_nn_in_garnett) %>%
  distinct() %>% 
  spread(key =auto_cluster_ext_type, value =  percent_nn_in_garnett, fill = 0) 
# create matrix for heatmap
matrix_for_heatmap =
  matrix_for_heatmap %>%
  tibble::column_to_rownames(var = "manual_cluster_ext_type") %>%
  as.matrix()

# save heatmap as pdf
pheatmap(matrix_for_heatmap,
         cellwidth = 15,
         cellheight = 15,
         cluster_rows = F,
         cluster_cols = F,
         fontsize_row = 10,
         fontsize_col = 10,
         legend = T,
         filename = "<filepath</<filename>.pdf",
         color = viridis(option = "viridis", n = 40))

heatmap

Check that the heatmap PDF saved in your specified file path and that it looks like the heatmap above. Examine the heatmap and take a few minutes and make a note of what you observe – specifically what cell types are classified

Save CDS

We will be using this CDS with cell type labels for the next class

saveRDS(cds, "<filepath>/cds_class3.RDS")

Wrapping Up

Today we learned how to:

use Garnett to classify cells with a pre-trained classifier and by training our own classifier
visualize cells with cell labels