Michael J Clark (wassname)

Author

Michael J Clark

Michael J Clark

Michael J Clark (wassname)

I use the handle wassname. ML engineer in Perth. I work on AI alignment research, specifically steering language models without human preference labels.

I’m building tools to ask AI hard questions-and know if they’re lying. Also exploring unsupervised ways to make AI more moral than humans.

Open to collaboration, especially on AntiPaSTO.

current work

vGROUT: steering vectors for reward-hacking suppression (partial negative, Jun 2026)

Can we use a hacking vector to remove reward hacking with gradient routing? Somewhat. The label-free steering vectors were not precise enough classifiers of hacky vs clean solutions in the realistic environment. The useful clue was initialization: signed-CorDA partially suppressed hacking by absorbing gradients into the hack-initialized quarantine adapter, dropping held-out hack from 0.759 to 0.218 in one 4B run. This is not a deployable operating point, but it is useful evidence because it uses synthetic pairs not labels, and strong labels may not be available for unknown reward hacks during frontier training.

Weak-to-strong character steering (WIP, with Lyptus)

weak to strong character steering

Weight steering offers an interface where a weaker model can modify a larger model’s moral character by interviewing it and creating persona pairs (weight steering because it beats activation steering by my measures). It can be iterative, can hopefully allow a large gap between weak and strong, and might even scale favourably with model size. Early draft is public now: a 9B teacher steering a 27B student toward “defer less to authority, care more”, with no human labels.

weak to strong character steering early results

Released along the way: steering-lite (hackable, calibrated activation steering), lora-lite (single-file LoRA on forward hooks), steer-heal-love (KL-constrained repeated steering that stays coherent), tinymfv (fast logprob eval of moral preference change).

selected works

AntiPaSTO: Self-Supervised Steering of Moral Reasoning

arXiv:2601.07473, Jan 2026

Gradient-based representation steering using the model’s own behavioral consistency as signal. Outperforms prompting on out-of-distribution transfer. Builds on prior representation alignment work that showed promise but had stability issues.

bidirectional steering of moral preferences

S-space steering for eval-awareness control

AI Control Hackathon, Apart Research, judged Mar 2026

Replicated eval-awareness paper with novel singular-value-basis (S-space) steering. Hawthorne gap on Qwen3-32B reduced to almost zero (1% vs prior work’s 26%).

more on github →

selected talks

Perth Machine Learning Group (3,400+ members) co-organizer. Selected talks:

selected writing

LessWrong — technical AI safety, policy

background

Kiwi from Christchurch, now in Perth. Physics BSc, MSc petroleum geoscience. Did oil & gas before switching to ML in 2016.

Day job: ML and modelling at Woodside Energy (I like scalable oversight, physics informed neural networks, and timeseries, including neural processes). Also board member at Cytophenix (medical AI for AMR) and partner at Three Springs Technology (ML consulting).


I want to optimize for the good ending, not the bad one.