Research Tools and Workflow
Setting up the right tools and workflow early saves countless hours during research. This guide covers essential tools for ML research: reference management, writing, experiment tracking, version control, and computational infrastructure.
Reference Management
Organize papers and citations systematically from day one.
Zotero (Recommended)
Why Zotero:
- Free and open source
- Browser extension for one-click paper saving
- Automatic citation generation (BibTeX, APA, etc.)
- PDF annotation and note-taking
- Cloud sync across devices
- Integrates with Word and LaTeX
Installation:
# macOS
brew install --cask zotero
# Linux
sudo apt-get install zotero
# Windows: Download from https://www.zotero.org/Setup Workflow:
-
Install components:
- Zotero desktop app
- Browser connector extension (Chrome/Firefox)
- Better BibTeX plugin (for LaTeX users)
-
Create collections:
- “Must Read” - High-priority papers
- “Related Work” - Papers to cite
- “Baselines” - Methods to compare against
- “Background” - Foundational papers
- Project-specific collections
-
Workflow:
- Click browser extension while on arXiv, Google Scholar, or journal site
- Paper saved with metadata, PDF attached
- Tag with keywords: “attention”, “multimodal”, “must-cite”
- Add notes and annotations directly in PDF
- Export to BibTeX for LaTeX papers
Better BibTeX Configuration:
{
"citekeyFormat": "[auth:lower][year]",
"autoExport": true,
"exportPath": "~/research/papers.bib"
}This generates citation keys like vaswani2017 for “Attention Is All You Need”.
Alternative: Mendeley
- Similar features to Zotero
- Owned by Elsevier
- Better integration with Word
- Good mobile app
Paper Reading Workflow
Combine with three-pass reading method:
- First pass: Save to Zotero, add to “To Read” collection
- Second pass: Annotate PDF, take notes in Zotero
- Third pass: Move to “Must Cite” or “Background” collection
LaTeX for Academic Writing
LaTeX is the standard for ML/AI papers and theses.
Why LaTeX?
- Professional typesetting - Beautiful equations and formatting
- Version control friendly - Plain text, works with Git
- Reference management - Automatic bibliography with BibTeX
- Conference templates - Required by most ML venues
- Reproducibility - Same output on any system
Installation Options
Option 1: Overleaf (Recommended for Beginners)
- Online LaTeX editor (no installation needed)
- Real-time collaboration (like Google Docs)
- Built-in templates for conferences
- Automatic compilation
- Version history and Git integration
- Free tier sufficient for most users
Sign up at overleaf.com
Option 2: Local Installation
# macOS - Full TeX distribution (~4GB)
brew install --cask mactex
# Linux - Full texlive
sudo apt-get install texlive-full
# Windows - MiKTeX from https://miktex.org/Recommended Local Editor: VS Code with LaTeX Workshop extension
# Install VS Code
brew install --cask visual-studio-code
# Install LaTeX Workshop extension
code --install-extension James-Yu.latex-workshopBasic Paper Template
Create paper.tex:
\documentclass{article}
% Essential packages
\usepackage{amsmath, amssymb} % Math symbols
\usepackage{graphicx} % Include figures
\usepackage{hyperref} % Hyperlinks
\usepackage{booktabs} % Professional tables
\usepackage{algorithm} % Algorithms
\usepackage{algorithmic}
% Document metadata
\title{Your Paper Title}
\author{Your Name \\ Your Institution}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
Your abstract goes here.
\end{abstract}
\section{Introduction}
Introduction content...
\section{Related Work}
Related work...
\section{Method}
Your method...
\section{Experiments}
Results and analysis...
\section{Conclusion}
Conclusions...
\bibliographystyle{plain}
\bibliography{references} % references.bib file
\end{document}Compile:
pdflatex paper.tex
bibtex paper
pdflatex paper.tex
pdflatex paper.tex # Run twice for referencesConference Templates
Most conferences provide LaTeX templates:
NeurIPS:
# Download from https://neurips.cc/Conferences/2024/PaperInformation/StyleFiles
wget https://media.neurips.cc/Conferences/NeurIPS2024/Styles/neurips_2024.zip
unzip neurips_2024.zipICML/ICLR: Similar templates available on conference websites
Use Overleaf templates:
- New Project → Templates → Search for “NeurIPS” or “ICML”
Essential LaTeX Commands
Math equations:
% Inline math
The loss function is $\mathcal{L} = -\log p(y|x)$.
% Display math
\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}Figures:
\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{architecture.pdf}
\caption{Model architecture.}
\label{fig:architecture}
\end{figure}
% Reference: See Figure~\ref{fig:architecture}Tables:
\begin{table}[t]
\centering
\caption{Results on benchmark datasets.}
\label{tab:results}
\begin{tabular}{lcc}
\toprule
Method & Accuracy & F1 Score \\
\midrule
Baseline & 85.3 & 83.1 \\
Our Method & \textbf{92.7} & \textbf{91.4} \\
\bottomrule
\end{tabular}
\end{table}Citations:
% In text
Transformers \cite{vaswani2017} revolutionized NLP.
% Multiple citations
Recent work \cite{vaswani2017, devlin2018, brown2020} has shown...Experiment Tracking
Track ML experiments systematically to avoid losing results.
Weights & Biases (Recommended)
Why W&B:
- Automatic experiment logging
- Hyperparameter tracking
- Real-time metric visualization
- Model versioning
- Collaborative experiment tracking
- Free for academic use
Installation:
pip install wandbBasic Usage:
import wandb
# Initialize experiment
wandb.init(
project="my-thesis",
name="transformer-baseline",
config={
"learning_rate": 1e-4,
"batch_size": 32,
"epochs": 100
}
)
# Training loop
for epoch in range(epochs):
train_loss = train_epoch(model, train_loader)
val_loss = validate(model, val_loader)
# Log metrics
wandb.log({
"train_loss": train_loss,
"val_loss": val_loss,
"epoch": epoch
})
# Save model
wandb.save("model.pt")Features:
- Automatic hyperparameter tracking
- Compare across runs
- Generate reports for papers
- Track system metrics (GPU, CPU, memory)
Alternative: MLflow
pip install mlflowUsage:
import mlflow
mlflow.start_run()
mlflow.log_param("learning_rate", 1e-4)
mlflow.log_metric("accuracy", 0.92)
mlflow.end_run()Manual Experiment Tracking
If not using tracking tools, maintain a structured log:
Create experiments.md:
# Experiment Log
## Experiment 1: Baseline Transformer
**Date**: 2025-11-11
**Goal**: Establish baseline performance
**Config**:
- Model: Transformer (6 layers, 512 dim)
- LR: 1e-4
- Batch size: 32
- Dataset: IMDB
**Results**:
- Test Accuracy: 85.3%
- Training time: 2.5 hours
- GPU: A100
**Notes**: Overfitting after epoch 50, try dropout
## Experiment 2: Add Dropout
**Date**: 2025-11-12
[...]Version Control with Git
Track code changes, collaborate, and never lose work.
Git Basics
Installation:
# macOS
brew install git
# Linux
sudo apt-get install gitInitialize repository:
cd my-thesis
git init
git add .
git commit -m "Initial commit"Daily workflow:
# See what changed
git status
git diff
# Commit changes
git add file1.py file2.py
git commit -m "Add attention mechanism"
# View history
git log --onelineGitHub for Collaboration
Create repository:
- Go to github.com, create new repository
- Link local repository:
git remote add origin https://github.com/username/my-thesis.git
git push -u origin mainCollaboration workflow:
# Create feature branch
git checkout -b add-bert-baseline
# Make changes, commit
git add bert.py
git commit -m "Add BERT baseline"
# Push and create pull request
git push origin add-bert-baseline.gitignore for ML Projects
Create .gitignore:
# Python
__pycache__/
*.pyc
.ipynb_checkpoints/
# Data (too large for Git)
data/
*.csv
*.hdf5
# Models (use model versioning instead)
models/
checkpoints/
*.pt
*.pth
*.h5
# Logs
logs/
wandb/
# Environment
venv/
.env
# OS
.DS_StoreGit Large File Storage (Git LFS)
For tracking large model files:
# Install
brew install git-lfs
git lfs install
# Track large files
git lfs track "*.pt"
git add .gitattributes
git commit -m "Track model files with LFS"Code Organization
Structure projects for reproducibility and collaboration.
Project Structure
my-thesis/
├── README.md # Project overview
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment (if using)
├── .gitignore
├── data/
│ ├── raw/ # Original data (never modify)
│ ├── processed/ # Cleaned data
│ └── README.md # Data documentation
├── src/
│ ├── __init__.py
│ ├── data.py # Data loading
│ ├── models.py # Model architectures
│ ├── train.py # Training loop
│ ├── eval.py # Evaluation
│ └── utils.py # Utilities
├── scripts/
│ ├── preprocess.py # Data preprocessing
│ ├── train.sh # Training script
│ └── eval.sh # Evaluation script
├── notebooks/
│ ├── 01_eda.ipynb # Exploratory analysis
│ └── 02_viz.ipynb # Visualization
├── configs/
│ ├── baseline.yaml # Baseline config
│ └── best_model.yaml # Best model config
├── tests/
│ └── test_model.py # Unit tests
└── paper/
├── paper.tex # LaTeX source
├── references.bib # Bibliography
└── figures/ # Paper figuresConfiguration Files
Use YAML for configs:
configs/transformer.yaml:
model:
type: transformer
num_layers: 6
d_model: 512
num_heads: 8
dropout: 0.1
training:
batch_size: 32
learning_rate: 1e-4
epochs: 100
warmup_steps: 1000
data:
train_path: data/processed/train.csv
val_path: data/processed/val.csvLoad in code:
import yaml
with open('configs/transformer.yaml') as f:
config = yaml.safe_load(f)
model = Transformer(**config['model'])Reproducibility Checklist
- Pin dependencies:
pip freeze > requirements.txt - Set random seeds:
torch.manual_seed(42) - Document data preprocessing: How was data cleaned?
- Save model configs: YAML files for each experiment
- Track hyperparameters: W&B or manual logs
- Version control code: Git with meaningful commits
- README with instructions: How to reproduce results
Computational Infrastructure
Local Development
Recommended Setup:
- GPU: NVIDIA RTX 3090/4090 for development (24GB VRAM)
- RAM: 32GB+ for data processing
- Storage: 1TB+ SSD for datasets and models
Environment Setup:
# Create conda environment
conda create -n thesis python=3.10
conda activate thesis
# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# Install other dependencies
pip install transformers datasets wandb numpy pandas matplotlibCloud Computing
Google Colab (Free/Pro):
- Free: T4 GPU (16GB), time limits
- Pro ($10/month): Better GPUs, longer runtime
- Good for prototyping, not production training
Lambda Labs (Recommended for GPU rentals):
- A100 (40GB): ~$1.10/hour
- H100 (80GB): ~$2/hour
- No complex setup, pay-as-you-go
AWS/GCP/Azure:
- More complex, but scalable
- Good for large-scale experiments
- Consider credits for students (AWS Educate, GCP Education Grants)
Remote Development
SSH into server:
# Connect to server
ssh user@server.university.edu
# Run training in background with tmux
tmux new -s training
python train.py
# Detach: Ctrl+B, then D
# Reattach later: tmux attach -t trainingVS Code Remote Development:
Install “Remote - SSH” extension, connect to server, edit code as if local.
Jupyter Notebooks
For exploration and visualization.
Installation:
pip install jupyter
jupyter notebookBest Practices:
- Use for exploration only - Don’t put training loops in notebooks
- Export to scripts - Convert final code to
.pyfiles - Clear outputs before committing - Avoid large Git diffs
- Name cells - Use markdown headers to organize
- Restart kernel regularly - Ensure reproducibility
Convert notebook to script:
jupyter nbconvert --to script notebook.ipynbPaper Writing Workflow
Combine tools for efficient writing:
- Organize papers: Zotero collections
- Draft in LaTeX: Overleaf or local editor
- Manage references: Zotero → BibTeX export
- Generate figures: Matplotlib/Seaborn → PDF
- Track versions: Git for LaTeX source
- Collaborate: Overleaf sharing or Git branches
Example Figure Generation:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-paper')
sns.set_palette("colorblind")
fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(epochs, train_loss, label='Train')
ax.plot(epochs, val_loss, label='Validation')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.legend()
plt.tight_layout()
plt.savefig('paper/figures/training_curve.pdf', dpi=300, bbox_inches='tight')Productivity Tips
Time Management
- Pomodoro Technique: 25 min focused work, 5 min break
- Deep Work Blocks: 2-4 hour uninterrupted periods for coding/writing
- Meeting-Free Days: Dedicate specific days for deep research
Documentation
- Document as you go: Don’t wait until the end
- README for every project: Explain setup and usage
- Code comments: Explain why, not what
- Lab notebook: Daily progress log
Backup Strategy
- 3-2-1 Rule:
- 3 copies of data
- 2 different storage types (local + cloud)
- 1 off-site backup
Automated Backup:
# Sync code to cloud
rclone sync ~/research/ gdrive:research/ --exclude "data/" --exclude "models/"
# Schedule with cron (daily at 2am)
0 2 * * * rclone sync ~/research/ gdrive:research/Tool Summary
| Purpose | Recommended Tool | Alternative |
|---|---|---|
| Reference Management | Zotero | Mendeley |
| Paper Writing | Overleaf | Local LaTeX |
| Experiment Tracking | Weights & Biases | MLflow |
| Version Control | Git + GitHub | GitLab |
| Code Editor | VS Code | PyCharm |
| Notebooks | JupyterLab | Google Colab |
| Cloud GPU | Lambda Labs | Google Colab Pro |
| Collaboration | Slack/Discord | |
| Diagramming | draw.io | PowerPoint |
Getting Started Checklist
Set up your research environment:
- Install Zotero + browser extension
- Create Overleaf account
- Set up Git and GitHub
- Create Weights & Biases account
- Install Python/PyTorch environment
- Structure project directory
- Write initial README
- Set up backup system
- Join research community (Slack/Discord)
- Schedule regular supervisor meetings
Resources
Tool Documentation
- Zotero: zotero.org/support
- Overleaf: overleaf.com/learn
- Git: git-scm.com/book
- W&B: docs.wandb.ai
Learning Resources
- LaTeX: Overleaf LaTeX Tutorials
- Git: GitHub Git Handbook
- Python Best Practices: The Hitchhiker’s Guide to Python
Related Content
- Publication Strategy - Navigate the publication process
- Structuring Research Papers - Write effective papers
- Research Methodology Path - Complete research workflow
Summary
Key Takeaways:
- Set up tools early: Don’t wait until you need them
- Automate everything: Experiment tracking, backups, figure generation
- Version control religiously: Git for code, configs, and papers
- Document continuously: README files, code comments, lab notebook
- Organize systematically: Consistent project structure, naming conventions
- Backup redundantly: Multiple copies in different locations
Good tools and workflow multiply your research productivity. Invest time upfront to set them up properly, and you’ll save hundreds of hours over the course of your thesis or research career.