[ChatGPT New Update] Introduction of the GPT-4o Model, the Birth of Innovative Multimodal AI

Posted by
OpenAI – Hello GPT-4o

In this post, we introduce ‘GPT-4o’, which is faster and more cost-effective than the previous GPT-4 Turbo.

‘GPT-4o’, released by OpenAI on May 13, 2024, is the latest model of ChatGPT. In this post, we will explain what ‘GPT-4o’ is, its innovative features, and how this model will bring about changes.


What is GPT-4o?

‘GPT-4o’ is OpenAI’s newly announced flagship model capable of processing text, audio, and image data in real-time. The ‘o’ stands for ‘omni’, signifying its ability to handle a wide range of inputs and outputs integratively. ‘GPT-4o’ is designed to make interactions with humans much more natural.


Key Features and Performance

Multimodal Input and Output

One of the biggest innovations of ‘GPT-4o’ is its ability to handle various inputs and outputs. This means it can understand and generate text, audio, and image data simultaneously.

  • Input Processing: ‘GPT-4o’ can accept text, audio, and images as input. For example, users can ask questions via voice, upload images, or enter text.
  • Output Generation: The model provides various forms of outputs such as text responses, audio responses, and image generation, enabling richer and more intuitive interactions.

Response Speed

‘GPT-4o’ boasts a very fast response speed to audio inputs. The average response time is 320 milliseconds, similar to the natural response time in human conversation. The minimum response time for audio inputs is just 232 milliseconds.

Performance Improvements

‘GPT-4o’ shows significant performance improvements compared to ‘GPT-4 Turbo’.

  • Text Processing: It maintains similar performance to ‘GPT-4 Turbo’ for English and coding text while showing even better performance for non-English languages.
  • Cost Efficiency: It offers API usage costs 50% lower than ‘GPT-4 Turbo’, allowing more users and developers to utilize high-performance AI without financial burden.
  • Processing Speed: The processing speed of the model is twice as fast, meaning it can handle more tasks in a shorter time.

Audio and Vision Understanding

‘GPT-4o’ has greatly improved capabilities in audio recognition and vision understanding.

  • Audio Recognition: It shows superior audio recognition performance across all languages compared to ‘Whisper-v3’, especially excelling in low-resource languages.
  • Audio Translation: It also demonstrates state-of-the-art performance in audio translation, surpassing ‘Whisper-v3’ in the MLS benchmark.
  • Vision Understanding: ‘GPT-4o’ excels in benchmarks requiring visual perception, such as object recognition in images and interpreting charts, indicating its excellent performance in various fields.

Unified Model Integration

‘GPT-4o’ processes text, audio, and image data within a single model, minimizing information loss between input and output, enabling more natural and consistent interactions.

  • Integrated Processing: Previously, using voice mode required multiple models to work together, but ‘GPT-4o’ handles all tasks with a single model. This allows it to better recognize voice tones, multiple speakers, background noises, and generate various audio outputs like laughter, singing, and emotional expressions.

Safety

‘GPT-4o’ was designed with safety considerations from the ground up.

  • Data Filtering: It enhances the model’s safety by filtering out unsafe content in the training data.
  • Behavior Adjustment: The model’s behavior is finely tuned through post-training.
  • Safety Systems: New safety systems are introduced to ensure the safety of audio outputs, which are limited to predefined voices and adhere to existing safety policies.

Performance Evaluation

‘GPT-4o’ has demonstrated excellent performance in various benchmark tests, showing outstanding capabilities in text, audio, and image processing, with significant improvements especially in non-English text processing, speech recognition, and image understanding. Let’s take a closer look at the main performance evaluation items.

Text Evaluation

  • Zero-shot COT(Chain of Thought) MMLU(Multitask Multilingual Understanding): In the Zero-shot COT MMLU benchmark, which evaluates the model’s ability to solve problems without prior learning, ‘GPT-4o’ achieved a new high score of 88.7%, demonstrating its excellent understanding across various languages and subjects.
  • 5-shot no-CoT MMLU: In the 5-shot no-CoT MMLU test, where the model learns from a few examples and solves given problems, ‘GPT-4o’ scored 87.2%, indicating its ability to understand and solve problems accurately even with few examples.

Audio Evaluation

  • ASR(Automatic Speech Recognition) Performance: ‘GPT-4o’ outperforms ‘Whisper-v3’ in audio recognition, accurately recognizing speech in various languages, including low-resource languages. This demonstrates the multimodal learning capabilities of ‘GPT-4o’ in processing audio data.
  • Audio Translation Performance: ‘GPT-4o’ shows state-of-the-art performance in audio translation, surpassing ‘Whisper-v3’ in the MLS benchmark, indicating its ability to accurately translate audio data across different languages.

Vision Evaluation

  • M3Exam: The M3Exam benchmark includes multilingual and visual assessments, evaluating multiple-choice questions from other countries’ standardized tests, including charts and diagrams. ‘GPT-4o’ outperforms ‘GPT-4’ across all languages, showing significant improvements in visual understanding capabilities.
  • Vision Understanding Evaluation: ‘GPT-4o’ achieves top performance in various visual recognition benchmarks, including zero-shot evaluations in MMMU(Multi-Modal Multi-Task Understanding), MathVista(a benchmark for solving mathematical problems), and ChartQA(a benchmark for interpreting charts and graphs). ‘GPT-4o’ excels at understanding and analyzing complex visual data without prior learning.

Language Tokenization

‘GPT-4o’ introduces new tokenization technology, significantly reducing the number of tokens in various languages, improving data compression efficiency and text processing. Even English and Spanish saw a 1.1x reduction in tokens.

  • Gujarati: Reduced from 145 to 33 tokens (4.4x reduction)
  • Telugu: Reduced from 159 to 45 tokens (3.5x reduction)
  • Tamil: Reduced from 116 to 35 tokens (3.3x reduction)
  • Marathi: Reduced from 96 to 33 tokens (2.9x reduction)
  • Hindi: Reduced from 90 to 31 tokens (2.9x reduction)
  • Urdu: Reduced from 82 to 33 tokens (2.5x reduction)
  • Arabic: Reduced from 53 to 26 tokens (2.0x reduction)
  • Persian: Reduced from 61 to 32 tokens (1.9x reduction)
  • Russian: Reduced from 39 to 23 tokens (1.7x reduction)
  • Korean: Reduced from 45 to 27 tokens (1.7x reduction)
  • Vietnamese: Reduced from 46 to 30 tokens (1.5x reduction)
  • Chinese: Reduced from 34 to 24 tokens (1.4x reduction)
  • Japanese: Reduced from 37 to 26 tokens (1.4x reduction)
  • Turkish: Reduced from 39 to 30 tokens (1.3x reduction)
  • Italian: Reduced from 34 to 28 tokens (1.2x reduction)
  • German: Reduced from 34 to 29 tokens (1.2x reduction)
  • Spanish: Reduced from 29 to 26 tokens (1.1x reduction)
  • Portuguese: Reduced from 30 to 27 tokens (1.1x reduction)
  • French: Reduced from 31 to 28 tokens (1.1x reduction)
  • English: Reduced from 27 to 24 tokens (1.1x reduction)

‘GPT-4o’ demonstrates excellent understanding and processing capabilities across various languages and situations through these performance evaluations, enabling more natural and efficient interactions by integratively handling text, audio, and image data.


Conclusion

‘GPT-4o’ can be considered a revolutionary model that opens a new horizon in artificial intelligence. Its ability to process text, audio, and image data integratively in real-time will make human-computer interactions more natural and efficient. Notably, its superior performance in various languages and significantly enhanced audio recognition and vision understanding capabilities are powerful features of ‘GPT-4o’.

Through the development and release of ‘GPT-4o’, OpenAI provides safer and more efficient AI technology, rapidly expanding the applicability of AI across various fields. With significant advancements in cost efficiency, response speed, and multimodal processing capabilities, ‘GPT-4o’ will continue to play a crucial role in the evolution of AI technology.

We encourage you to experience the amazing features and performance of ‘GPT-4o’ firsthand. This will allow you to anticipate how AI will bring changes to our daily lives and various industries.

Leave a Reply

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다