Documentation — molAOP Analyser

Getting Started

New to the tool? Take the guided tour — it walks you through every step of an analysis using a demo dataset.

Input File Format Specification

File Format

Accepted formats: CSV, TSV, or TXT
Maximum file size: 10 MB
Must be tabular data with column headers

Required Columns

Your data file must contain at least three columns (column names don't matter — the tool auto-detects them):

Gene Identifiers: HGNC symbols (e.g., TP53), Ensembl IDs (e.g., ENSG00000141510), or Entrez IDs
Fold Change: log₂ fold change values from differential expression analysis
P-values: Statistical significance values (raw or adjusted p-values)

Example Data Format

Gene_Symbol log2FoldChange pvalue padj TP53 2.34 0.0001 0.005 BRCA1 -1.56 0.023 0.089 EGFR 0.89 0.045 0.12 CYP3A4 3.21 0.00001 0.0001 ...

Note: Gene identifiers should be consistent throughout your file. Mixing identifier types (e.g., some symbols, some Ensembl IDs) may reduce matching efficiency.

Data Preprocessing Tips

Remove duplicate gene entries or aggregate them before upload
Ensure p-values are between 0 and 1
log₂FC values can be any real number (typically between -10 and +10)
Missing values (NA, NaN) in gene ID column will cause those rows to be skipped

Column Auto-Detection

The column detector automatically identifies which columns in your uploaded file contain gene identifiers, fold-change values, and p-values, so you can confirm or correct its choices with full context.

How the detector works

When you upload a file, the detector analyses both column names and the data they contain to identify the most likely role of each column. It assigns a confidence score based on how closely the column's name and values match the expected patterns for each column type.

Gene ID: Detects gene symbol name/format patterns (e.g. GAPDH, ENSG…), or Entrez integer IDs.
log2FC: Detects numeric values in log2-scale range (roughly −10 to +10), near-zero mean, both positive and negative values.
p-value: Detects numeric values strictly in [0, 1], mix of significant and non-significant values.
Adjusted p-value: Same criteria as p-value; column name containing 'adj', 'fdr', 'padj', or 'qvalue'.

Confidence levels

Level	Threshold	Meaning
High	≥ 80%	Strong match — column name and content both align. Safe to proceed.
Medium	≥ 60%	Probable match — review before running analysis.
Low	≥ 30%	Weak match — override recommended.
Very Low	< 30%	Not detected — manual selection required.

Overriding a detected column

Every column selector shows a dropdown below the detected value. You can pick a different column from your file at any time before running the analysis — the detected choice is a starting point, not a lock. Use the confidence badge and the reasons list shown beneath each selector to decide whether to keep or override the automatic selection.

Statistical Methods

Fisher's Exact Test

The enrichment analysis uses Fisher's exact test to determine if a Key Event gene set is over-represented among your significant genes. For each KE, a 2×2 contingency table is constructed:

| | In KE gene set | Not in KE gene set | |------------------|----------------|--------------------| | Significant | a | b | | Not significant | c | d |

Fisher's exact test calculates the probability of observing this distribution (or more extreme) under the null hypothesis that genes are randomly distributed.

False Discovery Rate (FDR) Correction

Because multiple KEs are tested simultaneously, we apply Benjamini-Hochberg FDR correction to control the false discovery rate. This adjusts p-values to account for multiple comparisons, reducing false positives.

Interpretation: FDR < 0.05 means there is less than 5% chance this result is a false positive among all significant results.

Odds Ratio

The odds ratio quantifies the strength of association between KE membership and significance:

OR = 1: No association
OR > 1: Positive association (enrichment)
OR < 1: Negative association (depletion)

For example, OR = 3.5 means genes in this KE are 3.5 times more likely to be significant than expected by chance.

Choosing between Fisher's exact and GSEA

What each test measures. Fisher's exact test asks whether the proportion of significant genes (those passing your log₂FC and p-value thresholds) inside a Key Event's gene set is higher than expected by chance — it is a hard-threshold, over-representation test on a 2×2 contingency table. GSEA asks instead whether the Key Event's gene set is collectively shifted toward the top (up-regulated) or bottom (down-regulated) of the full ranked list, without needing any threshold.

Input requirements. Fisher's exact needs a binary significance call per gene — so it depends on the thresholds you choose below the volcano plot. GSEA needs only a ranking metric per gene; this tool builds it as sign(log₂FC) × −log₁₀(p) so that direction and confidence both contribute. When you pick GSEA, the threshold inputs are hidden because they have no effect on the result.

When to use which. Use Fisher's exact when you want a confirmatory answer with explicit thresholds — for example, when a regulator expects a defined significance call. Use GSEA when you want an exploratory, direction-aware view that catches coherent shifts where no individual gene crosses your threshold. Both methods use the same Key Event gene sets, so results are directly comparable.

Caveats. Fisher's exact assumes genes are independent — which is rarely strictly true — and a single highly-expressed gene cannot rescue a non-significant Key Event. GSEA reports a leading-edge subset for each Key Event (the genes that drive the enrichment score); inspect these to confirm the signal is biologically coherent and not driven by one or two outliers. Both methods are FDR-corrected (Benjamini–Hochberg for Fisher; the GSEA permutation procedure produces its own FDR q-values).

Background Universe

The enrichment analysis uses all genes in your uploaded dataset as the background universe. This ensures the statistical test accounts for which genes were actually measured in your experiment.

Important: The background is NOT the entire human genome, but rather the genes you provide in your input file. This makes the test more appropriate for platform-specific data (e.g., RNA-seq, microarray).

Interpreting Results

Volcano Plot

The volcano plot visualizes the magnitude (log₂FC) and significance (-log₁₀ p-value) of gene expression changes:

Red points: Significantly upregulated genes (above FC threshold, p < 0.05)
Blue points: Significantly downregulated genes (below -FC threshold, p < 0.05)
Green points: Statistically significant but below FC threshold
Gray points: Not statistically significant

Tip: Adjust the log₂FC threshold using quick options (0, 0.5, 1.0, 1.5, 2.0) or percentage-based thresholds (top 10%, top 20%) to focus on the most extreme expression changes.

Enrichment Table

The enrichment results table shows which Key Events are over-represented in your significant genes:

Key Event Title: Name of the biological process or event
# Overlap: Number of your significant genes associated with this KE
Direction: Observed up/down counts of the overlap genes — e.g. 8↑ / 0↓ means all 8 overlap genes had positive log₂FC. The Fisher test itself is direction-agnostic; this column lets you visually compare the observed direction to the KE's expected direction in its title (e.g. a KE titled "Down Regulation, HMGCS2" should show mostly ↓ in a consistent dataset). Descriptive only — no statistical test, no expected-direction inference; interpretation is yours.
% Enrichment: Percentage of KE genes that are significant in your dataset
P-value: Statistical significance from Fisher's exact test
FDR: False Discovery Rate (adjusted p-value) using Benjamini-Hochberg correction
Odds Ratio: Magnitude of enrichment (>1 indicates over-representation)

Significance threshold: Typically, FDR < 0.05 is considered statistically significant after multiple testing correction.

AOP Network Visualization

The interactive network shows how Key Events connect within the selected AOP:

Node colors:
- Light green = Molecular Initiating Event (MIE)
- Light orange = Intermediate Key Event
- Light red = Adverse Outcome (AO)
Node borders:
- Red border = Significantly enriched KE (FDR < 0.05)
- Green border = Significantly affected gene
Gene nodes: Colored by expression (blue = downregulated, red = upregulated)
Edges: Gray lines show KE-KE relationships; thin gray lines show KE-gene associations

Network Controls

+ Add Gene Nodes: Display genes associated with each KE
Toggle Gene Visibility: Show/hide gene nodes
Reset View: Return to original layout and zoom
Download PNG: Export network visualization
Download Network: Export Cytoscape JSON file for further analysis

Looking for a GMT file? The curated KE→gene library used by this analyser (suitable for GSEA, Enrichr, fgsea, clusterProfiler) is served upstream by the molAOP Builder. This analyser produces a complementary per-analysis gene-by-KE CSV (see "Export gene-by-KE (CSV)" on the results page) showing exactly which genes from your dataset drove each KE.

Hub Genes

The Hub genes panel lists genes that are shared across multiple Key Events within the selected AOP. A gene is flagged as a hub when it appears in three or more distinct Key Events — these genes connect several parts of the pathway, so a change in their expression can influence multiple Key Events at once. The panel ranks genes by the number of Key Events they belong to, and the Show gene nodes and Significant genes only toggles control which genes are drawn on the network.

Pathway View

The Pathway view embeds the underlying WikiPathways diagram for a Key Event's mapped pathway, so you can inspect the biological pathway behind an enrichment result. Use the pathway picker dropdown to switch between pathways; entries are ordered so the most-enriched Key Events' pathways appear first. The Open full pathway link opens the diagram on wikipathways.org.

Note: The Pathway view shows the bare WikiPathways diagram — your gene expression values are not overlaid on it. For expression-coloured visualisation, use the AOP network above.

Batch Analysis Tutorial

Batch analysis lets you analyse multiple gene expression datasets in a single session, then compare enrichment results across conditions. This is useful for dose–response or time-course experiments.

Step 1: Upload Files

Click the Batch Analysis tab on the home page. You can add files in two ways:

Upload your own: Drag and drop up to 10 CSV/TSV/TXT files onto the drop zone, or click to browse
Use demo datasets: Expand the "Select Cisplatin Demo Datasets" panel and tick the files you want to include

Each uploaded file shows a preview of its first few rows so you can verify the data looks correct.

File requirements: Each file must contain gene identifiers, log₂ fold change values, and p-values — the same format as single analysis. All files should use the same column layout.

Step 2: Tag Conditions

Assign metadata to each file so results can be grouped and compared. For each file you can set:

Condition label: A short name for the experimental condition (e.g., "10 uM", "24 hr")
Timepoint: Exposure duration (e.g., "4hr", "24hr", "72hr")
Dose: Concentration (e.g., "0.1uM", "50uM")

For cisplatin demo files, these fields are auto-filled from the filename.

Step 3: Analysis Settings

Configure shared settings that apply to all files:

AOP selection: Search for an AOP by name or ID using the typeahead search
Gene ID column / FC column / P-value column: Select which columns to use (applied to all files)
log₂FC threshold: Minimum fold change for significance
P-value cutoff: Maximum p-value for significance (default 0.05)
Experiment metadata: Dataset ID, stressor name, owner, and description for reports

Running the Analysis

Click Run Batch Analysis to start. A progress modal shows the status of each file as it is processed. Once complete, you are taken to the batch summary page where you can view individual results or proceed to the comparison view.

Comparison Feature Guide

After completing a batch analysis, use the comparison view to identify patterns across conditions.

Heatmap View

The heatmap displays KE enrichment significance (FDR values) across all analysed conditions. Rows represent Key Events and columns represent conditions. Cells are coloured by significance level:

Darker colours indicate stronger enrichment (lower FDR)
Hover over a cell to see the exact FDR value, overlap count, and odds ratio
Rows and columns can be sorted to highlight patterns

Table View

The comparison table provides a detailed numeric view of enrichment results across conditions. Each row is a Key Event, and you can compare overlap counts, p-values, FDR, and odds ratios side by side.

Network Overlay

The network comparison overlays enrichment results from multiple conditions onto the same AOP network. KE nodes show aggregated significance across the selected conditions, making it easy to see which parts of the pathway are consistently affected.

Delta Mode

Delta mode highlights the differences between two selected conditions. It shows which Key Events become more or less enriched as conditions change (e.g., from low to high dose), helping identify dose–response transitions.

Contents