Michael J Clark (wassname)

Michael J Clark (wassname)
I use the handle wassname. ML engineer in Perth. I work on AI alignment research, specifically steering language models without human preference labels.
I’m building tools to ask AI hard questions-and know if they’re lying. Also exploring unsupervised ways to make AI more moral than humans.
Open to collaboration, especially on AntiPaSTO.
current work
vGROUT: removing reward hacking with gradient routing (WIP)
Can we use a hacking vector to remove reward hacking with gradient routing? We build the hacking vector from synthetic hack/honest pairs (the GRPO gradient update for the LoRA weights), then compare each training sample’s gradient with it: high cosine similarity gets routed to a quarantine adapter, and the vast majority of in-between gradients get sorted out by absorption. Preliminary result (still improving robustness): the vectors remove reward hacking much better than vanilla GRPO but reduce solving a bit. This is interesting because it uses synthetic pairs not labels, and relies on internal representations, which could scale well with model capability.
Weak-to-strong character steering (WIP, with Lyptus)
Weight steering offers an interface where a weaker model can modify a larger model’s moral character by interviewing it and creating persona pairs (weight steering because it beats activation steering by my measures). It can be iterative, can hopefully allow a large gap between weak and strong, and might even scale favourably with model size. It’s a work in progress, it’s hard to get it working reliably with small models.

Released along the way: steering-lite (hackable, calibrated activation steering), lora-lite (single-file LoRA on forward hooks), steer-heal-love (KL-constrained repeated steering that stays coherent), tinymfv (fast logprob eval of moral preference change).
selected works
AntiPaSTO: Self-Supervised Steering of Moral Reasoning
arXiv:2601.07473, Jan 2026
Gradient-based representation steering using the model’s own behavioral consistency as signal. Outperforms prompting on out-of-distribution transfer. Builds on prior representation alignment work that showed promise but had stability issues.
S-space steering for eval-awareness control
AI Control Hackathon, Apart Research, judged Mar 2026
Replicated eval-awareness paper with novel singular-value-basis (S-space) steering. Hawthorne gap on Qwen3-32B reduced to almost zero (1% vs prior work’s 26%).
selected talks
Perth Machine Learning Group (3,400+ members) co-organizer. Selected talks:
- Jan 2026 — AntiPaSTO: Self-Supervised Value Steering — Interpretability research
- May 2023 — AI Governance: Risk and Regulation — Panel at WA Data Science Week
- Aug 2019 — Experiments with GPT-2 Chatbots — Early LLM exploration
- Jun 2019 — Transformer Network Architecture — Attention mechanisms, BERT/GPT
- 2018-2021 — Industrial RL (bucketwheel reclaimers, robotic fruit picking), point clouds, neural processes
selected writing
LessWrong — technical AI safety, policy
- An Aphoristic Overview of Technical AI Alignment — one-sentence guide to alignment ideas
- Private Capabilities, Public Alignment — why we should open-source alignment methods
- More
background
Kiwi from Christchurch, now in Perth. Physics BSc, MSc petroleum geoscience. Did oil & gas before switching to ML in 2016.
Day job: ML and modelling at Woodside Energy (I like scalable oversight, physics informed neural networks, and timeseries, including neural processes). Also board member at Cytophenix (medical AI for AMR) and partner at Three Springs Technology (ML consulting).
I want to optimize for the good ending, not the bad one.
