As a university educator and researcher in biomedical sciences, I’ve watched with equal parts curiosity and concern as generative AI tools like ChatGPT entered the academic arena. Could these tools support learning or were they more likely to undermine academic integrity?
The AI Tipping Point in Higher Education
A recent study I conducted aimed to answer this pressing question – is generative AI any good at writing student-level essays? Previous studies, using early models of ChatGPT, had already claimed remarkable success at university exams, notably the National Board of Medical Examiners
(NBME) exam and the Uniform Bar Examination.
Rapid developments and improvement in generative AI models have only increased the concern among educators about the possibilities of cheating, plagiarism and the degradation of academic integrity. Concerns are also rife about the decline in student cognitive development, critical thinking and problem-solving skills due to AI over-use.

The Goal: Can AI Meet Academic Standards?
I tested three widely accessible generative AI tools – ChatGPT 3.5, Google Bard, and Microsoft Bing – by tasking each with writing essays typical of Levels 4 to 7 in the UK’s Framework for Higher Education Qualifications (FHEQ). These included first-year undergraduate through to MSc-level essay prompts from real modules at University College London (UCL).
Each AI’s output was assessed for scientific accuracy, mechanistic detail, context, coherence, and fidelity to the task. I evaluated the essays against FHEQ descriptors and provided qualitative feedback, just as I would for student submissions. The essays were also blind-marked by independent markers.
https://doi.org/10.1186/s41239-024-00485-y
What I Found: AI Is Capable, but Not Competent
1. ChatGPT Came Out on Top—But Fell Short at Higher Levels
ChatGPT produced the most coherent, well-written essays, especially at Level 4. But as the complexity of the task increased, its limitations became more apparent. It struggled with mechanistic depth, failed to include specific examples and couldn’t demonstrate the critical thinking or originality expected at Levels 6 and 7.
2. Bard and Bing Lagged Behind
Bard’s outputs were often vague and repetitive, while Bing, despite sounding plausible, lacked logical structure and deeper insight. Alarmingly, Bing fabricated references in an academic style, a clear threat to academic integrity.
3. Complexity Exposed AI Weaknesses
All tools performed best at Level 4 and declined as the required depth and criticality increased. None met the highest academic expectations at Level 7, although did a reasonably good job. This suggests that AI can replicate surface-level understanding but not the analytical rigour we expect of advanced learners.
4. Referencing Remains a Critical Failure Point
When asked to provide references for a Level 7 essay, none of the AI tools offered acceptable academic citations. Bard defaulted to general web sources, ChatGPT demurred entirely, and Bing produced fictitious but realistic-looking references, raising serious ethical and practical concerns.
Why This Matters: Implications for Higher Education
These findings raise important questions for how we design, deliver, and assess academic work in the age of AI. While some view these tools as productivity aids, my research suggests they currently pose more risk than reward when used to complete summative assessments.
-
Assessment Must Evolve
Traditional take-home essays, especially at introductory levels, are now vulnerable to AI generated text. We must diversify assessment formats to include oral exams, in-class writing, critical reflections, and problem-based learning tasks that AI can’t easily replicate, or integrate AI fully into the design of coursework. -
Critical Thinking is the New Frontier
At higher levels, AI falls short in synthesis, critical analysis and evidence-based argumentation. These gaps should inform how we scaffold our curricula; emphasising reasoning, argument construction and engagement with primary literature. -
Academic Integrity Needs Updating
We need to revise our academic conduct policies to explicitly address generative AI. Detection tools alone aren’t enough and most fall short functionally. We must focus on education, transparency and appropriate AI use guidelines for students and staff alike.
Digital Literacy is Now AI Literacy
Students should be trained not only to avoid academic misconduct but to understand when, how and why to use AI tools appropriately. These skills are increasingly relevant beyond university and are becoming increasingly desired in the workplace. Universities therefore have an obligation to train students on the proper and ethical use of generative AI.

My Recommendations: A Roadmap for Educators
- Redesign Assessments for Authenticity
Move toward formats that require personal engagement, critical thought and context-specific application. - Provide Clear AI Usage Guidelines
We can’t ignore AI but we can guide its use. Help students understand what constitutes acceptable use in your module and tie it to institutional values. - Teach the capabilities and limitations of AI
Students need to know that AI can’t access peer-reviewed databases, can hallucinate facts and doesn’t think. It’s a tool, not a scholar. - Embrace AI in Formative Learning
Used ethically, generative AI can support learning. For example, drafting outlines, clarifying ideas or exploring alternative explanations. Let’s teach students to use it as a partner, not a shortcut.
AI Can Help—but It Can’t Replace Thoughtful Human Work
My study shows that while generative AI tools like ChatGPT are getting better at mimicking student writing, they still fall short of delivering the academic rigour, critical insight and scientific nuance required in higher education, especially in the biomedical sciences.
These findings underscore the urgent need for thoughtful assessment design, robust academic integrity policies and AI literacy in our curricula. The question isn’t whether AI belongs in higher education, it’s how we adapt to ensure that human learning and critical thinking remain at the heart of it.
References:
Andrew Williams (2024). Comparison of generative AI performance on undergraduate and postgraduate written assessments in the biomedical sciences. International Journal of Educational Technology in Higher Education volume 21, Article number: 52 (2024) https://doi.org/10.1186/s41239-024-00485-y