Package 'TSGS'

Title: Trait Specific Gene Selection using Support Vector Machine and Genetic Algorithm
Description: Obtaining relevant set of trait specific genes from gene expression data is important for clinical diagnosis of disease and discovery of disease mechanisms in plants and animals. This process involves identification of relevant genes and removal of redundant genes as much as possible from a whole gene set. This package returns the trait specific gene set from the high dimensional RNA-seq count data by applying combination of two conventional machine learning algorithms, support vector machine (SVM) and genetic algorithm (GA). GA is used to control and optimize the subset of genes sent to the SVM for classification and evaluation. Genetic algorithm uses repeated learning steps and cross validation over number of possible solution and selects the best. The algorithm selects the set of genes based on a fitness function that is obtained via support vector machines. Using SVM as the classifier performance and the genetic algorithm for feature selection, a set of trait specific gene set is obtained.
Authors: Md. Samir Farooqi [aut], K.K. Chaturvedi [aut], D.C. Mishra [aut], Sudhir Srivastava [cre, aut]
Maintainer: Sudhir Srivastava <[email protected]>
License: GPL-2 | GPL-3
Version: 1.0
Built: 2025-03-10 02:33:18 UTC
Source: https://github.com/sudhirsrivastava/tsgs

Help Index


Trait specific gene selection using support vector machine and genetic algorithm

Description

This function gives the optimal set of informative genes based on RNA-Seq count data

Usage

featureSelect(X, y, p = 20, n.iter = 5, alpha = 0.05, p.adj.method = "bonferroni")

Arguments

X

X is a G x N data frame of gene expression values (raw count data) where rows represent genes and columns represent samples. Each cell entry represents the read counts of of a gene in a sample (row names of X as gene names or gene ids)

y

y is a N x 1 numeric vector with entries 0 or 1 representing sample labels, where, 0/1 represents the sample label of samples for two conditions, e.g., 0 for Control and 1 for Case

p

Population size, by default 20

n.iter

The number of iterations, by default 5

alpha

The level of significance, by default 0.05

p.adj.method

Method of adjusting p-values, by default "bonferroni". The other methods available are "BH", "holm", "hochberg", "hommel", "BY".

Value

InformativeGenes

List of informative genes selected

LogCPM

Log cpm data of informative genes

DEA_Result

Differential Expression Analysis Result of informative genes

Author(s)

c(person("Md. Samir", "Farooqi", email = "[email protected]", role = "aut"), person("K.K.", "Chaturvedi", email = "[email protected]", role = "aut"), person("D.C.", "Mishra", email = "[email protected]", role = "aut"), person("Sudhir", "Srivastava", email = "[email protected]", role = c("cre","aut")))

Examples

filename <- system.file("extdata", "exampleData.csv", package = "TSGS")
  cdata <- read.csv(filename, header = TRUE, row.names = 1, stringsAsFactors = FALSE)
  X <- as.data.frame(cdata[-1,])
  y <- as.numeric(cdata[1,])
  set.seed(100)
  result <- featureSelect(X, y, 20, 2, 0.05, "bonferroni")
  gene_list <- result$InformativeGenes
  logcpm_data <- result$LogCPM
  dea_result <- result$DEA_Result