Title: | Automated Transcriptome Classifier Pipeline: Comprehensive Transcriptome Analysis |
---|---|
Description: | An unsupervised fully-automated pipeline for transcriptome analysis or a supervised option to identify characteristic genes from predefined subclasses. We rely on the 'pamr' <http://www.bioconductor.org/packages//2.7/bioc/html/pamr.html> clustering algorithm to cluster the Data and then draw a heatmap of the clusters with the most significant genes and the least significant genes according to the 'pamr' algorithm. This way we get easy to grasp heatmaps that show us for each cluster which are the clusters most defining genes. |
Authors: | Karam Daka [cre, aut], Dieter Henrik Heiland [aut] |
Maintainer: | Karam Daka <[email protected]> |
License: | GPL-3 |
Version: | 0.1.6 |
Built: | 2024-11-13 04:45:37 UTC |
Source: | https://github.com/cran/AutoPipe |
This function is used to upload a table into R for further use in the AutoPipe
AutoPipe_tSNE(me,perplexity=30,max_iter=500,groups_men)
AutoPipe_tSNE(me,perplexity=30,max_iter=500,groups_men)
me |
The path of the expression table |
perplexity |
numeric; Perplexity parameter |
max_iter |
integer; Number of iterations (default: 1000) |
groups_men |
the data frame with the group clustering that the function Groups_Sup or top_supervised (2. place on the list) returns with the data about each sample and its coressponding cluster. |
this function calls the ConsensusClusterPlus function with thedaraset and plots a plot with the heatmaps of the clustering for each number of clusters from 2 to max_clust
cons_clust(data,max_clust,TOPgenes)
cons_clust(data,max_clust,TOPgenes)
data |
this is the data for the ConsensusClusterPlus |
max_clust |
the max number of clusters that should be evaluated. |
TOPgenes |
the number of the top genes to choose for the clustering |
plots a plot with all the heatmaps from the ConsensusClusterPlus for the number ofd clusters 2 to max_clust the same return value as the COnsensusClusterPlus
data(rna) cons_clust(rna,5,TOPgenes=50)
data(rna) cons_clust(rna,5,TOPgenes=50)
This function clusters the samples into x clusters.
Groups_Sup(me_TOP, me, number_of_k,TRw)
Groups_Sup(me_TOP, me, number_of_k,TRw)
me_TOP |
the matrix with the n top genes, usually the from output of the function TopPAM |
me |
the original expression matrix. (with genes in rows and samples in columns). |
number_of_k |
the number of clusters |
TRw |
threshold for the elemenation of the samples with a Silhouette width lower than TRw. Default value is -1. |
## load data library(org.Hs.eg.db) data(rna) me_x=rna res<-AutoPipe::TopPAM(me_x,max_clusters = 8, TOP=100) me_TOP=res[[1]] number_of_k=res[[3]] File_genes=Groups_Sup(me_TOP, me=me_x, number_of_k,TRw=-1) groups_men=File_genes[[2]] me_x=File_genes[[1]]
## load data library(org.Hs.eg.db) data(rna) me_x=rna res<-AutoPipe::TopPAM(me_x,max_clusters = 8, TOP=100) me_TOP=res[[1]] number_of_k=res[[3]] File_genes=Groups_Sup(me_TOP, me=me_x, number_of_k,TRw=-1) groups_men=File_genes[[2]] me_x=File_genes[[1]]
This function is used to upload a table into R for further use in the AutoPipe
read_expression_file(file, format = "csv", sep=";",gene_name="SYMBOL", Trans=FALSE)
read_expression_file(file, format = "csv", sep=";",gene_name="SYMBOL", Trans=FALSE)
file |
The path of the expression table |
format |
The format of the table "csv" or "txt" |
sep |
The seperator of the input table |
gene_name |
Genes are given in "SYMBOL" or "ENTREZID" |
Trans |
Need Matrix Transpose TRUE or FALSE |
A data.frame with a gene expression matrix
A dataset containing the gene expression data od 48 meningioma tumors
rna
rna
A data frame with 200 rows and 48 variables:
sample BT_1008,
sample BT_1017,
sample BT_1025,
sample BT_1042,
sample BT_1050,
sample BT_1056,
sample BT_1065,
sample BT_1067,
sample BT_1072,
sample BT_1078,
sample BT_1082,
sample BT_1091,
sample BT_1094,
sample BT_1097,
sample BT_1115,
sample BT_605,
sample BT_617,
sample BT_619,
sample BT_633,
sample BT_634,
sample BT_644,
sample BT_654,
sample BT_659,
sample BT_690,
sample BT_695,
sample BT_700,
sample BT_738,
sample BT_751,
sample BT_771,
sample BT_797,
sample BT_803,
sample BT_808,
sample BT_820,
sample BT_837,
sample BT_855,
sample BT_862,
sample BT_873,
sample BT_882,
sample BT_887,
sample BT_900,
sample BT_905,
sample BT_907,
sample BT_920,
sample BT_944,
sample BT_962,
sample BT_963,
sample BT_982,
sample BT_990,
...
This function produces a plot with a Heatmap using a supervised clustering algorithm which the user choses. with a the mean Silhouette width plotted on the right top corner and the Silhouette width for each sample on top. On the right side of the plot the n highest and lowest scoring genes for each cluster will added. And next to them the coressponding pathways (see Details)
Supervised_Cluster_Heatmap(groups_men, gene_matrix, method="PAMR",TOP=1000,TOP_Cluster=150, show_sil=FALSE,show_clin=FALSE,genes_to_print=5, print_genes=FALSE,samples_data=NULL,colors="RdBu", GSE=FALSE,topPaths=5,db="c2",plot_mean_sil=FALSE,stats_clust =NULL,threshold=2)
Supervised_Cluster_Heatmap(groups_men, gene_matrix, method="PAMR",TOP=1000,TOP_Cluster=150, show_sil=FALSE,show_clin=FALSE,genes_to_print=5, print_genes=FALSE,samples_data=NULL,colors="RdBu", GSE=FALSE,topPaths=5,db="c2",plot_mean_sil=FALSE,stats_clust =NULL,threshold=2)
groups_men |
the data frame with the group clustering that the function Groups_Sup or top_supervised (2. place on the list) returns with the data about each sample and its coressponding cluster. |
gene_matrix |
the matrix of n selected genes that the function Groups_Sup returns |
method |
the method to cluster of Clustering. The default is "PAMR" which uses the pamr library. other methods are SAM and our own "EXReg" (see details) |
TOP |
the number of the top genes to take. the default value is 1000. |
TOP_Cluster |
a numeric variable for the number of genes to include in the clusters. Default is 150. |
show_sil |
a logical value that indicates if the function should show the Silhouette width for each sample. Default is FALSE. |
show_clin |
a logical value if TRUE the function will plot the clinical data provided by the user. Default value is FALSE. |
genes_to_print |
the number of genes to print for each cluster. this function adds on the right side. of the heatmap the n highest expressed genes and the n lowest expressed genes for each cluster. Default value is 5. |
print_genes |
a logical value indicating if or not to plot the TOP genes for each cluster.Default value is FALSE. |
samples_data |
the clinical data provided by the user to plot under the heatmap. it will be plotted only if show_clin is TRUE. Default value is NULL. see details for format. |
colors |
the colors for the Heatmap. The function RColorBrewer palletes. |
GSE |
a logical variable that indicates wether to plot thr Gene Set Enrichment Analysis next to the heatmap. Default value is FALSE. |
topPaths |
a numerical value that says how many pathways the Gene Set Enrichment plots should contain fo each cluster. Default value is 5. |
db |
a value for the database for the GSE to be used. Default value is "c1". the paramater can one of the values: "c1","c2","c3",c4","c5","c6","c7","h". See the broad institue GSE GSE webpage for further information in each dataset. |
plot_mean_sil |
A logical value. if TRUE the function plots the mean of the Silhouette width for each cluster number or gap statistic. |
stats_clust |
A vector with the mean Silhouette widths or gap statistic for the number of clusters. The first value should be for 2 Clusters. 2nd is for 3 clusters and so on. |
threshold |
the threshhold for the pam analysis default is 2. |
sample data should be a data.frame with the sample names as rownames and the clinical triats as columns. each trait must be a numeric variable.
##load the org.Hs.eg Library library(org.Hs.eg.db) ## load data data(rna) me_x=rna ## calculate best number of clusters and res<-AutoPipe::TopPAM(me_x,max_clusters = 6, TOP=100) me_TOP=res[[1]] number_of_k=res[[3]] File_genes=Groups_Sup(me_TOP, me=me_x, number_of_k,TRw=-1) groups_men=File_genes[[2]] me_x=File_genes[[1]] o_g<-Supervised_Cluster_Heatmap(groups_men = groups_men, gene_matrix=me_x, method="PAMR",show_sil=TRUE,print_genes=TRUE,threshold=0, TOP = 100,GSE=FALSE,plot_mean_sil=TRUE,stats_clust=res[[2]])
##load the org.Hs.eg Library library(org.Hs.eg.db) ## load data data(rna) me_x=rna ## calculate best number of clusters and res<-AutoPipe::TopPAM(me_x,max_clusters = 6, TOP=100) me_TOP=res[[1]] number_of_k=res[[3]] File_genes=Groups_Sup(me_TOP, me=me_x, number_of_k,TRw=-1) groups_men=File_genes[[2]] me_x=File_genes[[1]] o_g<-Supervised_Cluster_Heatmap(groups_men = groups_men, gene_matrix=me_x, method="PAMR",show_sil=TRUE,print_genes=TRUE,threshold=0, TOP = 100,GSE=FALSE,plot_mean_sil=TRUE,stats_clust=res[[2]])
when perfoming a supervised clustering the user should run this function in order to get the best results.
top_supervised(me,TOP=1000,cluster_which,TRw=-1)
top_supervised(me,TOP=1000,cluster_which,TRw=-1)
me |
the matrix of the gene exporessions, the olums should be the samples and the colnames the sample names the rownames should be the genes . at best the ENTEREZID |
TOP |
the top genes to choose, default is 100. |
cluster_which |
a dataframe with the supervised clustering arrangment of the samples. the dataframe should have the sample names in the first column and the clustering in the secound column. |
TRw |
the threshhold for excluding samples with silhouette width < TRw |
a list. the first place is the expression matrix, the secound is the silhouette for each sample.
library(org.Hs.eg.db) data(rna) cluster_which<-cbind(colnames(rna),c(rep(1,times=24),rep(2,times=24))) me_x=rna ## calculate best number of clusters and res<-top_supervised(me_x,TOP = 100,cluster_which) me_TOP=res[[1]] number_of_k=2 groups_men=res[[2]] me_x=me_TOP colnames(me_x) o_g<-Supervised_Cluster_Heatmap(groups_men = groups_men, gene_matrix=me_x, method="PAMR",show_sil=TRUE,print_genes=TRUE,threshold = 0, TOP = 100,GSE=FALSE,plot_mean_sil=FALSE,stats_clust=res[[2]], samples_data = as.data.frame(groups_men[,1,drop=FALSE]))
library(org.Hs.eg.db) data(rna) cluster_which<-cbind(colnames(rna),c(rep(1,times=24),rep(2,times=24))) me_x=rna ## calculate best number of clusters and res<-top_supervised(me_x,TOP = 100,cluster_which) me_TOP=res[[1]] number_of_k=2 groups_men=res[[2]] me_x=me_TOP colnames(me_x) o_g<-Supervised_Cluster_Heatmap(groups_men = groups_men, gene_matrix=me_x, method="PAMR",show_sil=TRUE,print_genes=TRUE,threshold = 0, TOP = 100,GSE=FALSE,plot_mean_sil=FALSE,stats_clust=res[[2]], samples_data = as.data.frame(groups_men[,1,drop=FALSE]))
This function computes the n=TOP genes and the the best number of clusters
TopPAM(me, max_clusters=15,TOP=1000,B=100,clusterboot=FALSE)
TopPAM(me, max_clusters=15,TOP=1000,B=100,clusterboot=FALSE)
me |
a matrix with genes in rows and samples in columns |
max_clusters |
max. number of clusters to check |
TOP |
the number of genes to take. |
B |
integer, number of Monte Carlo (“bootstrap”) samples. |
clusterboot |
A logical value indicating wether or not to calculate the Gap statistic and to bootstrap. |
we use the clusGap algorithm from the package cluster to calculate the Gap statistic.
a list of 1. A matrix with the top genes 2. A list of means of the Silhouette width for each number of clusters. 3. The optimal number of clusters. 4. gap_st the gap statistic of the clustering 5. best number of clusters according to the gap statistic.
##load the org.Hs.eg Library library(org.Hs.eg.db) #' ## load data data(rna) me_x=rna res<-AutoPipe::TopPAM(me_x,max_clusters = 8, TOP=100,clusterboot=FALSE) me_TOP=res[[1]] number_of_k=res[[3]]
##load the org.Hs.eg Library library(org.Hs.eg.db) #' ## load data data(rna) me_x=rna res<-AutoPipe::TopPAM(me_x,max_clusters = 8, TOP=100,clusterboot=FALSE) me_TOP=res[[1]] number_of_k=res[[3]]
A function for unsupervised Clustering of the data
UnSuperClassifier(data,clinical_data=NULL,thr=2,TOP_Cluster=150,TOP=100)
UnSuperClassifier(data,clinical_data=NULL,thr=2,TOP_Cluster=150,TOP=100)
data |
the data for the clustering. Data should be in the following format: samples in columns and the genes in the rows (colnames and rownames accordingly). The rownames should be Entrez ID in order to plot a gene set enrichment analysis. |
clinical_data |
the clinical data provided by the user to plot under the heatmap. it will be plotted only if show_clin is TRUE. Default value is NULL. see details for format. |
thr |
The threshold for the PAMR algorithm default is 2. |
TOP_Cluster |
numeric; Number of genes in each cluster. |
TOP |
numeric; the number of the TOP genes to take from the gene exoression matrix see TopPAM TOP. |
sample data should be a data.frame with the sample names as rownames and the clinical triats as columns. each trait must be a numeric variable. @return the function is an autated Pipeline for clustering it plot cluster analysis for the geneset