Publishing My First Research Paper — What We Learned About Near-Peer Teaching

16 Jun 2026 in General

Reflecting on publishing my first peer-reviewed paper: what the data actually says about whether near-peer instruction helps students, and why the honest answer is more interesting than it first appears.

“A causal question answered with the wrong method isn’t just uninformative — it can lead you to abandon something that works.”

How This Started
The Problem With “Did It Help?”
What We Actually Did
Why Structure Might Actually Matter
What It Feels Like to Publish
The Bigger Takeaway for Program Evaluation
Read the Full Paper
Definitions

How This Started

When I took on the TA and tutor coordinator role for Human Physiology at BU, I wasn’t expecting to end up writing a research paper about it. I was mostly focused on making the review sessions useful and keeping the tutoring program from falling apart logistically — 50 tutors, 100+ tutees, a lot of scheduling headaches.

But partway through the semester, Dr. Schonhoff and I started looking at the data differently. We had exam scores for everyone, session attendance logs, tutoring records. And when we ran some early numbers, the results looked bad — students who attended TA sessions or worked with tutors were scoring lower than students who didn’t seek any support at all.

That felt wrong. Not just intuitively, but logically. Were we actually making things worse?

The Problem With “Did It Help?”

The short answer to why that initial result was misleading: students who seek out tutoring are not a random sample of the class.

Think about who comes to extra review sessions or reaches out for one-on-one tutoring. It’s rarely the students who are cruising. It’s the ones who are struggling, who scored poorly on the first quiz, who are worried. So when you compare their exam scores to students who didn’t seek help, you’re not comparing equivalent groups — you’re comparing people who started from different places.

In epidemiology, this is called confounding by indication. The very thing that “indicates” the treatment (in this case, academic struggle) also affects the outcome (exam scores). Programs designed to help at-risk students will almost always look ineffective — or even harmful — in naive comparisons, because the students using them were already behind.

This isn’t a new problem in education research. But it’s one that gets glossed over surprisingly often when programs get evaluated.

What We Actually Did

Rather than stop at a simple comparison, we built out a four-model causal framework, each one representing a more rigorous attempt to make fair comparisons between students who sought support and those who didn’t:

The naive view — just compare scores, no adjustment. This is where the apparent “harm” lives.
Control for where students started — add their previous exam score as a covariate. This asks: among students at the same baseline, was getting help associated with better outcomes?
Statistically rebalance the groups — use inverse probability weighting to make the tutored and non-tutored groups look more similar on observed characteristics before comparing them.
Use each student as their own control — the most stringent approach. Instead of comparing different students, compare the same student in blocks where they did vs. didn’t seek support. This removes all the stable unmeasured differences between people — test anxiety, help-seeking personality, background — because those traits are constant across blocks.

What happened as we moved down that list was exactly what you’d expect if the apparent harm was a statistical artifact: the negative tutoring association shrank with every step, eventually reaching a non-significant near-zero under the most rigorous model.

The structured TA sessions told the opposite story. Their estimate moved consistently in a positive direction as controls were added — from −0.92 percentage points in the naive model to +0.68 pp in the fixed effects model. Not statistically significant, but the direction never reversed. After removing all stable student-level confounders, TA sessions and tutoring pointed in opposite directions.

Why Structure Might Actually Matter

Here’s the part that I find most interesting from a practical standpoint: the two components of the program were not the same thing.

TA sessions were structured. Each session followed a 30/30/30 format — content review, practice problems, Socratic discussion. Content was faculty-reviewed before delivery. TAs were selected through a formal process and trained. Sessions happened on a predictable weekly schedule.

Peer tutoring was informal. Topics were driven by whatever the student felt stuck on that day. There was no standardized format, no faculty review, and (honestly) significant variation in how good individual tutors were at explaining things. The matching process was weighted toward scheduling availability.

The cognitive science of learning has pretty strong things to say about what makes studying effective: spaced practice, retrieval, worked examples with feedback, interleaving. Structured TA sessions build those in by design. Ad hoc tutoring sessions generally don’t.

The 2.01 percentage point divergence between the two modalities under our strictest model is consistent with this structural explanation — though we’re clear in the paper that a larger prospective study would be needed to characterize the effect with real precision.

What It Feels Like to Publish

I won’t pretend the process was glamorous. A lot of what goes into a paper is decisions about how to handle edge cases in the data, conversations about whether a particular model specification is defensible, and rounds of revisions.

But there’s something genuinely satisfying about taking a question that felt intuitive (“does this program help?”) and working through it carefully enough to give a more honest answer. The result isn’t “yes” or “no” — it’s “the apparent harm is an artifact, and the structural features of the program may matter more than raw participation.” That’s a more useful finding for anyone trying to run a program like this.

Dr. Schonhoff was a great collaborator throughout — pushing me to think more carefully about the causal assumptions at every step and making sure the story we told in the paper matched what the data actually supported.

The Bigger Takeaway for Program Evaluation

If there’s one thing I’d want someone evaluating a voluntary tutoring program to take from this: don’t interpret a naive negative association as evidence that your program is harmful. Selection bias runs in a very predictable direction here. Students who are struggling seek help. Their outcomes, compared to students who weren’t struggling enough to seek help, will look worse regardless of what the program does for them.

The more interesting question is whether, among students who were at similar starting points, the ones who got help did better or worse than those who didn’t. And even better than that: whether the same student performed differently in blocks where they participated versus blocks where they didn’t.

Those comparisons are harder to make — but they’re the ones that actually answer the question.

Read the Full Paper

The paper is published in Physiology (American Physiological Society) and is available through the link on my Research page.

Definitions

Near-Peer Instruction (NPI): Teaching and academic support provided by students who recently completed the same course material — close enough to relate to the difficulties, far enough ahead to help navigate them.
Confounding by Indication: A type of selection bias in which the factor that leads someone to seek a treatment (e.g., struggling academically) independently affects the outcome being measured (exam performance), making the treatment appear harmful in unadjusted analyses.
Fixed Effects Model: A statistical approach that uses each subject as their own control, removing the influence of all characteristics that are stable over time and may confound comparisons between groups.
IPTW (Inverse Probability of Treatment Weighting): An epidemiological method that statistically reweights observations to make treatment and control groups more comparable on observed covariates, reducing confounding before estimating effects.

Publishing My First Research Paper — What We Learned About Near-Peer Teaching

How This Started

The Problem With “Did It Help?”

What We Actually Did

Why Structure Might Actually Matter

What It Feels Like to Publish

The Bigger Takeaway for Program Evaluation

Read the Full Paper

Definitions

Joe Webb

Error

How This Started

The Problem With “Did It Help?”

What We Actually Did

Why Structure Might Actually Matter

What It Feels Like to Publish

The Bigger Takeaway for Program Evaluation

Read the Full Paper

Definitions

Templates (for web app):

Error