Why we're hiring a Lead Data Engineer
We're building the expert intelligence layer for scientific research: a knowledge graph that connects the world to leading experts based on publications & clinical trials in precise ontologies. You'll design pipelines that ingest millions of life-science records, shaping a graph of how scientific knowledge is modelled, enriched, & served.
This is true green-fields work. Your decisions will lay the data foundations for our entire expert intelligence platform.
What You'll Do
You will be working at the intersection of science, data engineering & AI to build expert intelligence.
- Own data end-to-end, design & run data pipelines turning millions of scientific records into a knowledge graph.
- Implement precision entity resolution & enrichment, disambiguate & enrich experts from noisy data sources.
- Utilise LLM workflows where it makes sense, for entity extraction, relationship inference & quality validation
- Develop vector embeddings & semantic search capabilities to power expert discovery & similarity matching.
- Model life-science entities & relationships, ontologies, author networks, publication & clinical trial metadata.
- Build graph & vector data access, performant, accessible, reliable, observable & testable data access.
- Move fast & ship value incrementally, done-and-iterating beats perfect-and-pending.
- Radiate intent & document your thinking openly, collaborating async-first in a hybrid environment
- Lead when you're the expert, follow when someone else is, challenging assumptions when necessary
- Use AI as a daily force multiplier across coding, schema design, debugging, optimisation & validation.
- Destroy your colleagues at Geoguessr (optional but strongly encouraged).
What You'll Need
Technical Skills
- Graph Databases: Neo4j, ArangoDB, Neptune; schema design, relationship modelling, query optimisation.
- Python Data Engineering: ETL development; pandas/polars; distributed processing with Spark or Dask.
- Entity Resolution: Deduplication, merging, enrichment across heterogeneous scientific data sources.
- AI-Assisted Data Extraction: LLM entity extraction, schema generation & quality validation.
- Vector Search: Experience with Pinecone, FAISS, Qdrant, or Weaviate; embeddings, hybrid retrieval.
- Workflow Orchestration: Robust, observable pipelines using Airflow or Dagster.
- Data Formats & Standards: Parquet, JSONL, RDF/Turtle; selecting formats for graph & semantic use cases.
- Embedding Models: Understanding of HuggingFace/OpenAI models, dimensionality tradeoffs & cost.
Executive Skills
- Ownership mindset: Treat data & schemas as products powering multiple domains.
- Strategic evaluation: Choose tech aligned with our scale, latency expectations, & roadmap needs.
- Process engineering: Build reliable, repeatable & maintainable workflows.
- Cross-functional communication: Bridge product engineers & scientific domain teams.
- Comfort with scientific data realities: Deep rabbit holes of sprawling complexity.
Strong Bonus
- Life Sciences familiarity: Publication, clinical trial, institutional, ontologies (MeSH, SNOMED, Gene Ontology).
- Hands-on with scientific datasets: OpenAlex, PubMed/MEDLINE, ORCID, Semantic Scholar, ClinicalTrials.gov
Why You Might Hate It Here
- You want predictability & routine.
- You dislike documenting or sharing your thinking openly.
- You see AI as a threat rather than an amplifier.
- You're looking for a "safe" corporate environment - we're not that.
We mean this sincerely: if those points do not work, you'll be happier elsewhere.
Why You'll Love Working Here
- Real Autonomy: You'll own outcomes, not tickets. This is your domain - you'll define data strategy.
- Greenfield Opportunity: Build the from scratch. Your decisions shape our data capabilities for years.
- Mission That Matters: Your work directly enables research - accelerating scientific breakthroughs.
- AI-First Culture: We use AI as a creative & operational partner across every function.
- High Impact: Every domain depends on what you build. Expert coverage directly drives our success.
Success Metrics (6-month target)
- Expert Coverage: Knowledge graph spans 1+ million experts with rich profile data & relationships.
- AI & Platform Enablement: AI & other domains consuming knowledge graph insights.