Research Tools and Workflow

Setting up the right tools and workflow early saves countless hours during research. This guide covers essential tools for ML research: reference management, writing, experiment tracking, version control, and computational infrastructure.

Reference Management

Organize papers and citations systematically from day one.

Zotero (Recommended)

Why Zotero:

Free and open source
Browser extension for one-click paper saving
Automatic citation generation (BibTeX, APA, etc.)
PDF annotation and note-taking
Cloud sync across devices
Integrates with Word and LaTeX

Installation:


# macOS
brew install --cask zotero
 
# Linux
sudo apt-get install zotero
 
# Windows: Download from https://www.zotero.org/

Setup Workflow:

Install components:
- Zotero desktop app
- Browser connector extension (Chrome/Firefox)
- Better BibTeX plugin (for LaTeX users)
Create collections:
- “Must Read” - High-priority papers
- “Related Work” - Papers to cite
- “Baselines” - Methods to compare against
- “Background” - Foundational papers
- Project-specific collections
Workflow:
- Click browser extension while on arXiv, Google Scholar, or journal site
- Paper saved with metadata, PDF attached
- Tag with keywords: “attention”, “multimodal”, “must-cite”
- Add notes and annotations directly in PDF
- Export to BibTeX for LaTeX papers

Better BibTeX Configuration:


{
  "citekeyFormat": "[auth:lower][year]",
  "autoExport": true,
  "exportPath": "~/research/papers.bib"
}

This generates citation keys like vaswani2017 for “Attention Is All You Need”.

Alternative: Mendeley

Similar features to Zotero
Owned by Elsevier
Better integration with Word
Good mobile app

Paper Reading Workflow

Combine with three-pass reading method:

First pass: Save to Zotero, add to “To Read” collection
Second pass: Annotate PDF, take notes in Zotero
Third pass: Move to “Must Cite” or “Background” collection

LaTeX for Academic Writing

LaTeX is the standard for ML/AI papers and theses.

Why LaTeX?

Professional typesetting - Beautiful equations and formatting
Version control friendly - Plain text, works with Git
Reference management - Automatic bibliography with BibTeX
Conference templates - Required by most ML venues
Reproducibility - Same output on any system

Installation Options

Option 1: Overleaf (Recommended for Beginners)

Online LaTeX editor (no installation needed)
Real-time collaboration (like Google Docs)
Built-in templates for conferences
Automatic compilation
Version history and Git integration
Free tier sufficient for most users

Option 2: Local Installation


# macOS - Full TeX distribution (~4GB)
brew install --cask mactex
 
# Linux - Full texlive
sudo apt-get install texlive-full
 
# Windows - MiKTeX from https://miktex.org/

Recommended Local Editor: VS Code with LaTeX Workshop extension


# Install VS Code
brew install --cask visual-studio-code
 
# Install LaTeX Workshop extension
code --install-extension James-Yu.latex-workshop

Basic Paper Template

Create paper.tex:


\documentclass{article}
 
% Essential packages
\usepackage{amsmath, amssymb}  % Math symbols
\usepackage{graphicx}          % Include figures
\usepackage{hyperref}          % Hyperlinks
\usepackage{booktabs}          % Professional tables
\usepackage{algorithm}         % Algorithms
\usepackage{algorithmic}
 
% Document metadata
\title{Your Paper Title}
\author{Your Name \\ Your Institution}
\date{\today}
 
\begin{document}
 
\maketitle
 
\begin{abstract}
Your abstract goes here.
\end{abstract}
 
\section{Introduction}
Introduction content...
 
\section{Related Work}
Related work...
 
\section{Method}
Your method...
 
\section{Experiments}
Results and analysis...
 
\section{Conclusion}
Conclusions...
 
\bibliographystyle{plain}
\bibliography{references}  % references.bib file
 
\end{document}

Compile:


pdflatex paper.tex
bibtex paper
pdflatex paper.tex
pdflatex paper.tex  # Run twice for references

Conference Templates

Most conferences provide LaTeX templates:

NeurIPS:


# Download from https://neurips.cc/Conferences/2024/PaperInformation/StyleFiles
wget https://media.neurips.cc/Conferences/NeurIPS2024/Styles/neurips_2024.zip
unzip neurips_2024.zip

ICML/ICLR: Similar templates available on conference websites

Use Overleaf templates:

New Project → Templates → Search for “NeurIPS” or “ICML”

Essential LaTeX Commands

Math equations:


% Inline math
The loss function is $\mathcal{L} = -\log p(y|x)$.
 
% Display math
\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

Figures:


\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{architecture.pdf}
\caption{Model architecture.}
\label{fig:architecture}
\end{figure}
 
% Reference: See Figure~\ref{fig:architecture}

Tables:


\begin{table}[t]
\centering
\caption{Results on benchmark datasets.}
\label{tab:results}
\begin{tabular}{lcc}
\toprule
Method & Accuracy & F1 Score \\
\midrule
Baseline & 85.3 & 83.1 \\
Our Method & \textbf{92.7} & \textbf{91.4} \\
\bottomrule
\end{tabular}
\end{table}

Citations:


% In text
Transformers \cite{vaswani2017} revolutionized NLP.
 
% Multiple citations
Recent work \cite{vaswani2017, devlin2018, brown2020} has shown...

Experiment Tracking

Track ML experiments systematically to avoid losing results.

Weights & Biases (Recommended)

Why W&B:

Automatic experiment logging
Hyperparameter tracking
Real-time metric visualization
Model versioning
Collaborative experiment tracking
Free for academic use

Installation:


pip install wandb

Basic Usage:


import wandb
 
# Initialize experiment
wandb.init(
    project="my-thesis",
    name="transformer-baseline",
    config={
        "learning_rate": 1e-4,
        "batch_size": 32,
        "epochs": 100
    }
)
 
# Training loop
for epoch in range(epochs):
    train_loss = train_epoch(model, train_loader)
    val_loss = validate(model, val_loader)
 
    # Log metrics
    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "epoch": epoch
    })
 
# Save model
wandb.save("model.pt")

Features:

Automatic hyperparameter tracking
Compare across runs
Generate reports for papers
Track system metrics (GPU, CPU, memory)

Alternative: MLflow


pip install mlflow

Usage:


import mlflow
 
mlflow.start_run()
mlflow.log_param("learning_rate", 1e-4)
mlflow.log_metric("accuracy", 0.92)
mlflow.end_run()

Manual Experiment Tracking

If not using tracking tools, maintain a structured log:

Create experiments.md:


# Experiment Log
 
## Experiment 1: Baseline Transformer
**Date**: 2025-11-11
**Goal**: Establish baseline performance
**Config**:
- Model: Transformer (6 layers, 512 dim)
- LR: 1e-4
- Batch size: 32
- Dataset: IMDB
 
**Results**:
- Test Accuracy: 85.3%
- Training time: 2.5 hours
- GPU: A100
 
**Notes**: Overfitting after epoch 50, try dropout
 
## Experiment 2: Add Dropout
**Date**: 2025-11-12
[...]

Version Control with Git

Track code changes, collaborate, and never lose work.

Git Basics

Installation:


# macOS
brew install git
 
# Linux
sudo apt-get install git

Initialize repository:


cd my-thesis
git init
git add .
git commit -m "Initial commit"

Daily workflow:


# See what changed
git status
git diff
 
# Commit changes
git add file1.py file2.py
git commit -m "Add attention mechanism"
 
# View history
git log --oneline

GitHub for Collaboration

Create repository:

Go to github.com, create new repository
Link local repository:


git remote add origin https://github.com/username/my-thesis.git
git push -u origin main

Collaboration workflow:


# Create feature branch
git checkout -b add-bert-baseline
 
# Make changes, commit
git add bert.py
git commit -m "Add BERT baseline"
 
# Push and create pull request
git push origin add-bert-baseline

.gitignore for ML Projects

Create .gitignore:


# Python
__pycache__/
*.pyc
.ipynb_checkpoints/

# Data (too large for Git)
data/
*.csv
*.hdf5

# Models (use model versioning instead)
models/
checkpoints/
*.pt
*.pth
*.h5

# Logs
logs/
wandb/

# Environment
venv/
.env

# OS
.DS_Store

Git Large File Storage (Git LFS)

For tracking large model files:


# Install
brew install git-lfs
git lfs install
 
# Track large files
git lfs track "*.pt"
git add .gitattributes
git commit -m "Track model files with LFS"

Code Organization

Structure projects for reproducibility and collaboration.

Project Structure


my-thesis/
├── README.md              # Project overview
├── requirements.txt       # Python dependencies
├── environment.yml        # Conda environment (if using)
├── .gitignore
├── data/
│   ├── raw/              # Original data (never modify)
│   ├── processed/        # Cleaned data
│   └── README.md         # Data documentation
├── src/
│   ├── __init__.py
│   ├── data.py           # Data loading
│   ├── models.py         # Model architectures
│   ├── train.py          # Training loop
│   ├── eval.py           # Evaluation
│   └── utils.py          # Utilities
├── scripts/
│   ├── preprocess.py     # Data preprocessing
│   ├── train.sh          # Training script
│   └── eval.sh           # Evaluation script
├── notebooks/
│   ├── 01_eda.ipynb      # Exploratory analysis
│   └── 02_viz.ipynb      # Visualization
├── configs/
│   ├── baseline.yaml     # Baseline config
│   └── best_model.yaml   # Best model config
├── tests/
│   └── test_model.py     # Unit tests
└── paper/
    ├── paper.tex         # LaTeX source
    ├── references.bib    # Bibliography
    └── figures/          # Paper figures

Configuration Files

Use YAML for configs:

configs/transformer.yaml:


model:
  type: transformer
  num_layers: 6
  d_model: 512
  num_heads: 8
  dropout: 0.1
 
training:
  batch_size: 32
  learning_rate: 1e-4
  epochs: 100
  warmup_steps: 1000
 
data:
  train_path: data/processed/train.csv
  val_path: data/processed/val.csv

Load in code:


import yaml
 
with open('configs/transformer.yaml') as f:
    config = yaml.safe_load(f)
 
model = Transformer(**config['model'])

Reproducibility Checklist

Pin dependencies: pip freeze > requirements.txt
Set random seeds: torch.manual_seed(42)
Document data preprocessing: How was data cleaned?
Save model configs: YAML files for each experiment
Track hyperparameters: W&B or manual logs
Version control code: Git with meaningful commits
README with instructions: How to reproduce results

Computational Infrastructure

Local Development

Recommended Setup:

GPU: NVIDIA RTX 3090/4090 for development (24GB VRAM)
RAM: 32GB+ for data processing
Storage: 1TB+ SSD for datasets and models

Environment Setup:


# Create conda environment
conda create -n thesis python=3.10
conda activate thesis
 
# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
 
# Install other dependencies
pip install transformers datasets wandb numpy pandas matplotlib

Cloud Computing

Google Colab (Free/Pro):

Free: T4 GPU (16GB), time limits
Pro ($10/month): Better GPUs, longer runtime
Good for prototyping, not production training

Lambda Labs (Recommended for GPU rentals):

A100 (40GB): ~$1.10/hour
H100 (80GB): ~$2/hour
No complex setup, pay-as-you-go

AWS/GCP/Azure:

More complex, but scalable
Good for large-scale experiments
Consider credits for students (AWS Educate, GCP Education Grants)

Remote Development

SSH into server:


# Connect to server
ssh user@server.university.edu
 
# Run training in background with tmux
tmux new -s training
python train.py
# Detach: Ctrl+B, then D
# Reattach later: tmux attach -t training

VS Code Remote Development:

Install “Remote - SSH” extension, connect to server, edit code as if local.

Jupyter Notebooks

For exploration and visualization.

Installation:


pip install jupyter
jupyter notebook

Best Practices:

Use for exploration only - Don’t put training loops in notebooks
Export to scripts - Convert final code to .py files
Clear outputs before committing - Avoid large Git diffs
Name cells - Use markdown headers to organize
Restart kernel regularly - Ensure reproducibility

Convert notebook to script:


jupyter nbconvert --to script notebook.ipynb

Paper Writing Workflow

Combine tools for efficient writing:

Organize papers: Zotero collections
Draft in LaTeX: Overleaf or local editor
Manage references: Zotero → BibTeX export
Generate figures: Matplotlib/Seaborn → PDF
Track versions: Git for LaTeX source
Collaborate: Overleaf sharing or Git branches

Example Figure Generation:


import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('seaborn-v0_8-paper')
sns.set_palette("colorblind")
 
fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(epochs, train_loss, label='Train')
ax.plot(epochs, val_loss, label='Validation')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.legend()
 
plt.tight_layout()
plt.savefig('paper/figures/training_curve.pdf', dpi=300, bbox_inches='tight')

Productivity Tips

Time Management

Pomodoro Technique: 25 min focused work, 5 min break
Deep Work Blocks: 2-4 hour uninterrupted periods for coding/writing
Meeting-Free Days: Dedicate specific days for deep research

Documentation

Document as you go: Don’t wait until the end
README for every project: Explain setup and usage
Code comments: Explain why, not what
Lab notebook: Daily progress log

Backup Strategy

3-2-1 Rule:
- 3 copies of data
- 2 different storage types (local + cloud)
- 1 off-site backup

Automated Backup:


# Sync code to cloud
rclone sync ~/research/ gdrive:research/ --exclude "data/" --exclude "models/"
 
# Schedule with cron (daily at 2am)
0 2 * * * rclone sync ~/research/ gdrive:research/

Tool Summary

Purpose	Recommended Tool	Alternative
Reference Management	Zotero	Mendeley
Paper Writing	Overleaf	Local LaTeX
Experiment Tracking	Weights & Biases	MLflow
Version Control	Git + GitHub	GitLab
Code Editor	VS Code	PyCharm
Notebooks	JupyterLab	Google Colab
Cloud GPU	Lambda Labs	Google Colab Pro
Collaboration	Slack/Discord	Email
Diagramming	draw.io	PowerPoint

Getting Started Checklist

Set up your research environment:

Resources

Tool Documentation

Zotero: zotero.org/support
Overleaf: overleaf.com/learn
Git: git-scm.com/book
W&B: docs.wandb.ai

Learning Resources

LaTeX: Overleaf LaTeX Tutorials
Git: GitHub Git Handbook
Python Best Practices: The Hitchhiker’s Guide to Python

Publication Strategy - Navigate the publication process
Structuring Research Papers - Write effective papers
Research Methodology Path - Complete research workflow

Summary

Key Takeaways:

Set up tools early: Don’t wait until you need them
Automate everything: Experiment tracking, backups, figure generation
Version control religiously: Git for code, configs, and papers
Document continuously: README files, code comments, lab notebook
Organize systematically: Consistent project structure, naming conventions
Backup redundantly: Multiple copies in different locations

Good tools and workflow multiply your research productivity. Invest time upfront to set them up properly, and you’ll save hundreds of hours over the course of your thesis or research career.

Research Tools and Workflow

Reference Management

Zotero (Recommended)

Alternative: Mendeley

Paper Reading Workflow

LaTeX for Academic Writing

Why LaTeX?

Installation Options

Basic Paper Template

Conference Templates

Essential LaTeX Commands

Experiment Tracking

Weights & Biases (Recommended)

Alternative: MLflow

Manual Experiment Tracking

Version Control with Git

Git Basics

GitHub for Collaboration

.gitignore for ML Projects

Git Large File Storage (Git LFS)

Code Organization

Project Structure

Configuration Files

Reproducibility Checklist

Computational Infrastructure

Local Development

Cloud Computing

Remote Development

Jupyter Notebooks

Paper Writing Workflow

Productivity Tips

Time Management

Documentation

Backup Strategy

Tool Summary

Getting Started Checklist

Resources

Tool Documentation

Learning Resources

Related Content

Summary