Blog

Rethinking assessment in the age of AI


Share This Post

Rethinking assessment in the age of AI

Insights from a recent Digital Learning Institute (DLI) webinar with Cohen Ambrose

In a recent Digital Learning Institute webinar, Cohen Ambrose, Course Director at DLI, explored a question that is becoming harder for educators, trainers, and learning teams to ignore:

If AI can help people produce polished work and even complete complex tasks, what still counts as evidence of learning?

Rather than treating this only as an academic integrity issue, Cohen argued for something deeper: a rethink of how we design assessment itself.

The session focused on the growing gap between produced something and learned something, especially now that generative AI and agentic AI tools can support not just writing, but planning, coding, analysis, drafting, and multi-step task completion. The challenge is no longer simply detecting AI use. It is figuring out how to identify durable human capability when performance can be heavily AI-assisted.

What followed was a practical and thought-provoking exploration of why traditional proxies for learning are breaking down, why transfer across contexts matters more than ever, and how educators can begin redesigning assessment to collect better evidence of understanding.

This blog captures the core ideas from the session, rewritten as a guide you can use.

Watch the full webinar on demand

Catch the full session with Cohen Ambrose, plus Q&A highlights.

Download the recording

A polished output is no longer reliable evidence of learning

One of the clearest messages from the webinar was that a high-quality artifact can no longer be treated as dependable proof that learning has taken place.

For a long time, educators and learning professionals could reasonably use a submitted essay, report, project, presentation, or other piece of assessed work as a proxy for what someone had understood. If a learner produced something strong, we often assumed they had internalized the relevant knowledge and skills.

Cohen’s argument was that this assumption no longer holds.

In an AI-rich environment, someone can now produce a sophisticated output, or even perform a sequence of tasks successfully, without necessarily developing the underlying capability we hope they will retain. In his framing, successful performance is no longer sufficient evidence of durable human capability.

This distinction matters in both formal education and workplace learning. A student may submit an excellent assignment. A colleague may complete a task effectively with AI support. But in either case, the visible outcome alone does not tell us whether the person can explain, adapt, justify, or transfer that performance without relying on the tool in the same way next time.

That changes the role of assessment. The goal cannot simply be to judge the final product. It must be to gather evidence that learning has actually happened.

The real issue is the gap between “produced something” and “learned something”

A useful framing from the webinar was the need to close the gap between produced something and learned something.

That gap has always existed to some extent. Learners could memorize, mimic, or perform in familiar conditions without truly understanding. But generative AI and agentic tools dramatically widen that gap because they make it easier to generate fluent, convincing, and well-structured outputs at speed.

This means that traditional assessment designs may now be collecting evidence of the wrong thing.

They may show that a learner can coordinate tools, respond strategically to a brief, or produce something polished. But that is not the same as demonstrating understanding. Cohen’s argument was that we cannot credibly claim someone understands a body of knowledge unless we have verifiable evidence that they can use that knowledge in more than one novel context.

That phrase matters.

It is not enough to perform once in the same context in which learning occurred. Real understanding becomes more visible when learners can adapt, respond, and apply what they know when the conditions change.

Why context matters more than many assessments acknowledge

A major theme in the webinar was that understanding cannot be separated from context.

If learners are only ever asked to perform in the same environment in which they were taught, then what we are often measuring is not durable capability but the ability to rehearse. They may be able to reproduce a method or respond correctly within familiar conditions, but that does not tell us whether they can do so elsewhere.

Cohen illustrated this through a simple analogy: children may practise a social skill effectively during classroom circle time, then fail to apply it on the playground because the context has changed. The knowledge did not transfer automatically.

The same applies in higher education, training, and workplace development. A learner who can respond well to a familiar case study, assessment prompt, or supported classroom task may still struggle in a different setting where the cues, constraints, or stakes have shifted.

This is why transfer across novel contexts becomes so important. It was always valuable evidence of learning. In the age of AI, it becomes even more critical.

Generative AI changed the task. Agentic AI changes the whole process

The webinar also distinguished between generative AI and agentic AI, and why that matters for assessment.

Generative AI is already familiar to most people: a prompt goes in, an output comes back. That output might be text, images, code, or ideas. While powerful, it still often requires the user to manage the larger task structure themselves.

Agentic AI changes the dynamic.

As Cohen explained, agentic systems can break down larger tasks into subtasks, recommend next steps, retrieve resources, coordinate actions, and operate with partial autonomy within user-defined constraints. In practice, that means a learner can now ask a tool to carry out large parts of an end-to-end process with relatively little intervention.

A student might upload a full project folder, including briefs, notes, sources, feedback, rubrics, and datasets, then ask an agentic system to produce a research plan, organize themes, draft a literature review, generate slides, write speaker notes, or even create a working prototype. The tool is no longer just helping with one output. It is helping orchestrate the whole assignment.

That creates a very different assessment challenge.

If AI can now support the full architecture of a task, then redesigning assessment around “process not just product” is necessary but no longer sufficient on its own. We need to think more carefully about what part of the process still shows human understanding, and under what conditions.

We need a learning-theory response, not just an integrity response

A particularly important point from the webinar was that the AI challenge cannot be addressed by integrity policies alone.

Cohen argued that while academic integrity conversations remain important, they are not enough. What is needed is a stronger learning-theoretic frame.

That means asking questions such as:

  • What does learning look like under conditions of AI support?

  • Which forms of delegation help learning, and which undermine it?

  • How do we know whether supported performance has become retained capability?

  • What kinds of assessment collect evidence of judgment, explanation, and transfer rather than just fluent completion?

This is where the webinar introduced an emerging idea: that the conditions of learning may have changed enough to require new conceptual tools.

A new learning theory for AI-rich environments?

Cohen referenced a recent preprint proposing a new learning theory called agentivism.

The argument behind this proposal is that new learning theories tend to emerge when the conditions of learning shift significantly, whether because of technological, social, economic, or epistemic change. Generative and agentic AI may represent exactly that kind of shift.

Under these conditions, the central question is no longer simply what the learner knows. Instead, it becomes something closer to this:

Can the learner later explain, justify, adapt, and transfer what they have done, with reduced dependence on the tool?

That is a subtle but significant move. It shifts the focus away from a completed output and toward the learner’s relationship with the reasoning behind it.

This is especially useful because it avoids a simplistic anti-AI position. The goal is not to ban support. It is to distinguish between supported performance that contributes to learning and supported performance that merely disguises its absence.

Four mechanisms that help explain learning with AI

One of the most useful parts of the session was Cohen’s walkthrough of four “core mechanisms of learning with AI.” These provide a practical lens for evaluating whether AI-assisted work is actually contributing to learning.

1. Delegated agency

This asks how responsibility for the task is distributed between the learner and the AI.

Delegation is not automatically a problem. In fact, it can be productive. Learners can preserve agency when they remain responsible for framing the problem, setting criteria, identifying quality, and deciding what matters.

The risk appears when those functions gradually migrate to the tool. A learner may begin by critically shaping the process, then slowly shift into approving whatever the system suggests. At that point, the AI is not just assisting. It is starting to drive the thinking.

2. Epistemic monitoring and verification

This is about whether learners evaluate outputs for truthfulness, relevance, and adequacy, or whether they simply accept fluent language as a sign of correctness.

In practice, this means comparing sources, interrogating outputs, understanding that models are biased and incomplete, and actively checking what the tool produces. It is not enough to receive something plausible. Learners need to remain engaged in judging whether it is trustworthy and useful.

3. Reconstructive internalization

This may be the most important of the four.

The idea is that learning has not yet occurred unless the learner can reconstruct why the accepted AI output is appropriate, identify where it would fail, adapt it to a new situation, or reproduce the underlying reasoning with less assistance.

In other words, can they explain it in their own words? Can they move away from the generated output and still make sense of the logic behind it?

If not, then the output may have been accepted, but not learned.

4. Transfer under reduced support

This asks whether the learner can adapt and perform in a new context with less reliance on the same tools.

This is the strongest evidence that retained capability exists. If learners can only succeed when the exact same supports are present, then competence may reside in the human-AI configuration rather than in the human alone.

Together, these mechanisms offer a helpful warning: what looks like learning may sometimes be sophisticated configuration. A learner can orchestrate tools intelligently and still fail to build durable capability.

What this looks like in practice: two versions of the same learner

To make the issue concrete, Cohen told the story of a fictional student, Aoife Donnelly, a final-year bioinformatics student about to begin a workplace internship.

The power of the example came from its comparison of two paths. Aoife had access to the same tools in both. The difference was not the technology. It was the assessment design she experienced.

Path A: polished output, weak capability

In the first version, Aoife used AI tools to help her complete a final-year research project. Over time, that support expanded. The study design, code, statistics, drafting, and revisions were increasingly handled by the tools. She still read the final output carefully and could speak about it at a surface level, so she passed her viva and earned a strong result.

But when she entered the workplace, things changed.

She encountered a real clinical dataset with privacy restrictions. She could not use public AI tools in the same way. The internal chatbot available at work could only offer generic support. Suddenly, what mattered was not fluent output but understanding: how to preprocess data, read existing code, interpret methodological decisions, and act responsibly under real constraints.

Because her assessment had rewarded polished completion rather than durable capability, she struggled.

Path B: same result, very different readiness

In the second version, Aoife still used AI, but the assessment architecture was different. She had to submit process artifacts such as weekly logs, prompt decisions, code provenance, and justifications for her modelling choices. She used AI tutors that scaffolded reasoning rather than doing the work for her. She encountered simulation-based challenges that introduced new contexts and had to complete an oral defence with novel questions.

She still submitted a polished report. She still earned the same grade.

But the outcome was different because the assessment had required her to internalize the reasoning, explain decisions, and apply her knowledge under changed conditions. When she entered the internship, she could read code, justify choices, and use the company’s private AI tools appropriately.

The contrast was clear: the same tools can produce very different learning outcomes depending on the assessment regime around them.

Assessment should be designed as an architecture, not a single event

Another strong theme from the webinar was the idea that assessment should be treated less as an isolated submission point and more as an architecture across the whole learning experience.

This means moving away from a model where the final product carries most of the evidential weight. Instead, assessment becomes a structured sequence that brings together:

  • assessment as learning

  • assessment for learning

  • assessment of learning

Rather than existing separately, these modes work together through the duration of a module, project, or program.

Cohen suggested that scenario-based, story-based, simulation-based, and case-based approaches work particularly well here because they situate learning in something more dynamic than a static task. They also create opportunities to test what happens when conditions change.

That matters because a learner who can only perform under stable, familiar conditions may not yet have developed transferable capability.

The kinds of evidence worth collecting

The webinar proposed several forms of evidence that can help assessment focus more on process, reasoning, and transfer.

These included:

  • prompt trajectories

  • AI interaction sequences

  • revision histories

  • decision logs

  • reflections on why particular outputs were accepted or rejected

  • code provenance

  • oral or video explanations

  • responses to new constraints or altered scenarios

What makes these valuable is not just that they expose use of AI. It is that they can encourage learners to think metacognitively about their own process.

In this sense, collecting process evidence can itself become part of learning. It helps learners notice how they are using tools, where their judgment enters, and whether they are actually reconstructing understanding or simply moving outputs around.

AI tutors and simulations may become more important

Cohen also pointed to two design directions that may become increasingly useful.

The first is the use of custom AI tutors or assessment assistants that scaffold learner reasoning without completing the task for them. These tools can prompt, sequence, and challenge the learner, refusing to move on until certain ideas have been articulated or justified.

The second is the use of AI-enhanced simulations and scenario-based roleplays. These can place learners into conditions they have not previously encountered and require them to respond in real time to uncertainty, contradiction, or changing constraints.

At DLI, Cohen described experimentation with simulations and “uncertainty cards” that deliberately shift the conditions of a case. For example:

  • a key piece of evidence turns out to be flawed

  • a major project goal changes

  • a data source is outdated

  • assumptions that shaped the original response are no longer valid

  • the work now needs to function in another culture or environment

These kinds of disruptions do not just test recall. They test whether learners can adapt, reframe, and transfer their reasoning under pressure.

Why oral defence may return, even if it is hard to scale

The webinar also suggested that oral defence, in some form, may become increasingly difficult to avoid.

That does not mean every assessment must become a traditional viva. But it does point toward the value of formats where learners must explain, justify, and respond to something they have not pre-scripted.

This is one of the hardest things to scale, especially in large programs. But it may also be one of the most effective ways to distinguish between polished performance and internalized capability.

Cohen also noted that audio diaries, video reflections, and portfolio walkthroughs may offer more scalable alternatives in some contexts, especially if learners are asked to respond to novel prompts rather than simply narrate a pre-prepared account.

The design challenge is not just to collect more evidence. It is to collect evidence that reflects what the learner can still do when the support conditions change.

“Authentic assessment” may not be enough as a concept

In the discussion, Cohen also raised an interesting challenge to the language of “authentic assessment.”

His point was not that realism or relevance are unimportant. Rather, it was that “authentic” is often used too loosely, as if it automatically solves the problem.

A more useful approach may be to think in terms of assessment ecologies or assessment architectures: whole systems of tasks, supports, constraints, and responses that together provide richer evidence of capability over time.

That shift matters because the issue is not just whether an assessment resembles the real world. It is whether it shows how learners think, adapt, justify, and transfer under conditions that matter.

What to take forward

If there was one message at the heart of the webinar, it was this:

We need to stop assuming that polished output equals learning.

In an AI-rich environment, assessment has to become more intentional about what it is trying to capture. If we care about durable human capability, then we need designs that ask learners to explain their reasoning, stay accountable for decisions, respond to new contexts, and perform with reduced dependence on the tool.

That does not require rejecting AI. In many fields, AI-supported work is now part of normal practice. But it does require a sharper distinction between using AI as a partner in learning and outsourcing the very capability we claim to be developing.

Some useful questions to take back to your own context are:

  • Where are we currently treating polished output as evidence of understanding?

  • Which of our assessments collect product, but not process?

  • How often do we ask learners to explain, justify, or adapt what they have done?

  • Are we testing performance only in familiar conditions?

  • What would change if we designed for transfer across multiple novel contexts?

  • Which parts of our current assessments are really measuring configuration rather than capability?

  • Where could simulations, uncertainty, or oral explanation help us gather better evidence?

As Cohen’s session made clear, the challenge is not simply that AI can now produce more. It is that educators and learning teams must become much more precise about what counts as learning when tools can do so much of the visible work.

Assessment, in that world, cannot just verify completion. It has to reveal capability.

FAQs

What is the main assessment challenge in the age of AI?
The central challenge is that polished outputs and even successful task completion are no longer reliable evidence that a learner has developed genuine understanding or durable capability.

What is meant by durable human capability?
It refers to capability that remains with the person, not just with the tool-supported workflow. It includes being able to explain, justify, adapt, and transfer performance beyond the original task.

Why is process evidence more important now?
Because final products can be heavily AI-assisted, process evidence helps show how the learner thought, decided, revised, judged outputs, and developed understanding along the way.

What is agentic AI?
Agentic AI refers to systems that can carry out multi-step tasks with partial autonomy, such as planning, retrieving resources, coordinating subtasks, and producing outputs across an entire workflow.

How is agentic AI different from generative AI?
Generative AI usually responds to prompts with outputs such as text or images. Agentic AI can manage longer chains of action and take on larger parts of a complete task.

What are some better ways to assess learning now?
More effective approaches may include process logs, prompt histories, code provenance, oral defences, simulations, roleplays, uncertainty-based scenario shifts, and portfolio explanations.

Why does transfer matter so much?
Because being able to perform in one familiar setting does not prove understanding. Transfer across new contexts is stronger evidence that learning has been internalized.

Do we need to ban AI to protect learning?
No. The webinar’s argument was not anti-AI. It was that assessment should be designed to show whether learners are using AI in ways that still develop and reveal human capability.

What are uncertainty cards?
They are prompts or disruptions introduced into a task or scenario to shift the conditions and test how learners adapt, justify decisions, and apply knowledge in a new context.

What should educators redesign first?
A good place to start is any assessment that currently depends heavily on a polished final artifact and offers little visibility into the learner’s reasoning, process, or ability to respond to change.

Keep learning with live events, eBooks & more

Browse DLI's free resource library including expert-led masterclasses, webinars, ebooks, and toolkits to help you stay at the forefront of digital learning.

Browse live events & resources