Understanding the BLEU and ROUGE Metrics for Large Language Models in 10 min using ChatGPT with Plugins

Hair Parra
7 min readAug 5, 2023

By using GPT-4, ScholarAI, and SmartSlides

Generated Using Bing Image Creator

In the world of Natural Language Processing (NLP), evaluating the performance of language models is crucial. Two popular metrics used for this purpose are BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation). These metrics are widely used in tasks such as machine translation and text summarization. Most recently, these metrics have been heavily used in evaluating Large Language Models.

In this article, I will show you how I used GPT-4-powered ChatGPT prompt engineering to quickly explore relevant research papers about these metrics, summarize and understand them concisely, including mathematical rendering in LaTeX. I produced relevant, runnable Python code examples using Huggingface datasets on Google Colab, and even created a slide presentation, all in under 10 minutes. When used properly, the usage of Large Language Models is almost unlimited!


  • ChatGPT Extensions
  • Prompt Engineering
  • Retrieving Scholar Papers related to BLEU and ROUGE metrics
  • Understanding the Mathematics of the BLEU and ROUGE metrics



Hair Parra

Data Scientist & Data Engineer. CS, Stats & Linguistics graduate. Polyglot.