What a Broken Benchmark Taught Me About Reproducible Experiments Lessons Learned from Debugging Abnormal Power Results 2026-05-04