Understanding the BLEU and ROUGE Metrics for Large Language Models in 10 min using ChatGPT with Plugins

Hair Parra
7 min readAug 5, 2023

By using GPT-4, ScholarAI, and SmartSlides

Generated Using Bing Image Creator

In the world of Natural Language Processing (NLP), evaluating the performance of language models is crucial. Two popular metrics used for this purpose are BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation). These metrics are widely used in tasks such as machine translation and text summarization. Most recently, these metrics have been heavily used in evaluating Large Language Models.

In this article, I will show you how I used GPT-4-powered ChatGPT prompt engineering to quickly explore relevant research papers about these metrics, summarize and understand them concisely, including mathematical rendering in LaTeX. I produced relevant, runnable Python code examples using Huggingface datasets on Google Colab, and even created a slide presentation, all in under 10 minutes. When used properly, the usage of Large Language Models is almost unlimited!


  • ChatGPT Extensions
  • Prompt Engineering
  • Retrieving Scholar Papers related to BLEU and ROUGE metrics
  • Understanding the Mathematics of the BLEU and ROUGE metrics
  • Generating Python Examples of BLEU and ROUGE
  • Generating a Powerpoint using SmartSlides

ChatGPT Extensions

ChatGPT Plus version enables the usage of Plugins, which enhance the underlying model by providing it additional specific functionality that wouldn’t be possible through the LLM alone. In this case, I used two specific Plugins: ScholarAI and SmartSlides. ScholarAI enables ChatGPT to access a gigantic database of academic articles and research papers about thousands of topics, while SmartSlides uses the context of the conversation to automatically create slides for you. Although I do not dive into the functionality of these plugins in this article, I strongly encourage you to check them out!

Prompt Engineering

Prompt engineering is an active field of research which consists of improving Large Language Models’ (LLMs)…



Hair Parra

Data Scientist & Data Engineer. CS, Stats & Linguistics graduate. Polyglot.