Michael J Clark (wassname)

Author

Michael J Clark

Michael J Clark

Michael J Clark (wassname)

I use the handle wassname. ML engineer in Perth. I work on AI alignment research, specifically steering language models without human preference labels.

I’m building tools to ask AI hard questions-and know if they’re lying. Also exploring unsupervised ways to make AI more moral than humans.

Open to collaboration, especially on AntiPaSTO.

current work

vGROUT: removing reward hacking with gradient routing (WIP)

Can we use a hacking vector to remove reward hacking with gradient routing? We build the hacking vector from synthetic hack/honest pairs (the GRPO gradient update for the LoRA weights), then compare each training sample’s gradient with it: high cosine similarity gets routed to a quarantine adapter, and the vast majority of in-between gradients get sorted out by absorption. Preliminary result (still improving robustness): the vectors remove reward hacking much better than vanilla GRPO but reduce solving a bit. This is interesting because it uses synthetic pairs not labels, and relies on internal representations, which could scale well with model capability.

Weak-to-strong character steering (WIP, with Lyptus)

Weight steering offers an interface where a weaker model can modify a larger model’s moral character by interviewing it and creating persona pairs (weight steering because it beats activation steering by my measures). It can be iterative, can hopefully allow a large gap between weak and strong, and might even scale favourably with model size. It’s a work in progress, it’s hard to get it working reliably with small models.

weak to strong character steering early results

Released along the way: steering-lite (hackable, calibrated activation steering), lora-lite (single-file LoRA on forward hooks), steer-heal-love (KL-constrained repeated steering that stays coherent), tinymfv (fast logprob eval of moral preference change).

selected works

AntiPaSTO: Self-Supervised Steering of Moral Reasoning

arXiv:2601.07473, Jan 2026

Gradient-based representation steering using the model’s own behavioral consistency as signal. Outperforms prompting on out-of-distribution transfer. Builds on prior representation alignment work that showed promise but had stability issues.

bidirectional steering of moral preferences

S-space steering for eval-awareness control

AI Control Hackathon, Apart Research, judged Mar 2026

Replicated eval-awareness paper with novel singular-value-basis (S-space) steering. Hawthorne gap on Qwen3-32B reduced to almost zero (1% vs prior work’s 26%).

more on github →

selected talks

Perth Machine Learning Group (3,400+ members) co-organizer. Selected talks:

selected writing

LessWrong — technical AI safety, policy

background

Kiwi from Christchurch, now in Perth. Physics BSc, MSc petroleum geoscience. Did oil & gas before switching to ML in 2016.

Day job: ML and modelling at Woodside Energy (I like scalable oversight, physics informed neural networks, and timeseries, including neural processes). Also board member at Cytophenix (medical AI for AMR) and partner at Three Springs Technology (ML consulting).


I want to optimize for the good ending, not the bad one.