Skip to main content
Loading the Elevenlabs Text to Speech AudioNative Player...

TL;DR: This article explores the rapid advancements in AI, tracing a path from the fundamental algorithms of machine learning to the frontiers of multimodal AI applications. It highlights the evolution of algorithms, the rise of deep reinforcement learning, and the emergence of generative AI and large language models. The section also explores the democratization of AI through open-source initiatives and the ground-breaking developments in diffusion models, generative audio, and multimodal AI.

  • The foundations of machine learning
  • The evolution of algorithms and the rise of neural networks
  • Deep reinforcement learning and mastery through practice
  • Generative AI and large language models
  • Open-source AI and the democratization of technology
  • Breakthroughs in diffusion models, generative audio, and multimodal AI

The story of Artificial Intelligence (AI) unfolds like a sequence of cascading breakthroughs, each igniting the possibilities of the technology’s potential like flashes of lightning illuminating a stormy night. This article chronicles these advancements, tracing a path from the fundamental algorithms of machine learning to the frontiers of multimodal AI applications. 

The Foundations of Machine Learning

Our contemporary AI revolution was largely driven by machine learning (ML), a computational process that enables systems to learn from data without explicit programming. This paradigm shift unlocked a world where machines could not only process information but actively evolve their understanding. 

Pioneering researchers like Arthur Samuel in the 1950s and Tom Mitchell in the 1970s laid the groundwork for this field, establishing core principles like supervised, unsupervised, and reinforcement learning. These algorithms, initially simple, formed the foundation for the sophisticated models that power AI today. 

Examples of more recent applications of machine learning include: 

  • Recommendation engines: Pioneered by companies like Amazon and Netflix, these systems analyze user behavior patterns to suggest relevant products or content.
  • Spam filtering: Google and other email providers utilize machine learning to identify and filter out unwanted spam messages. 
  • Fraud detection: Financial institutions leverage machine learning to identify and prevent fraudulent transactions. 
  • Maps and navigation: Google Maps and other navigation apps employ machine learning to optimize routes, predict traffic patterns, and suggest alternative routes in real-time. 

These are just a few examples of how machine learning has transformed various industries, paving the way for a future where AI continues to reshape our world. 

Evolution of Algorithms 

The earliest algorithms were but a faint echo of their present-day counterparts. Decision trees, akin to flowcharts, were among the initial attempts at building intelligent systems. These rudimentary algorithms classified data based on a series of predefined rules, but their inflexibility limited their ability to handle complex tasks. As the field progressed, more sophisticated algorithms emerged. 

The true turning point came with the rise of artificial neural networks. Inspired by the structure and functions of the human brain, these networks consist of interconnected nodes, each simulating a neuron. By adjusting the connections between these nodes based on training data, neural networks can learn complex patterns and relationships within the data. The development of the backpropagation algorithm in the 1980s, which allowed for efficient training of these networks, further propelled their advancement. 

As algorithms advanced, so did the understanding of the pivotal role data plays in AI. Once seen as cumbersome, vast datasets became the lifeblood of machine learning. The ability to collect, store, and process massive amounts of data became crucial for training and improving AI models. The rise of Big Data technologies and advancements in distributed computing facilitated the handling of these ever-growing datasets. 

However, data alone is not enough. Quality and relevance of data significantly impact the performance of AI models. Techniques like data cleaning and feature engineering became essential in preparing data for effective utilization in machine learning tasks. 

Deep Reinforcement Learning: Mastery Through Practice

Deep reinforcement learning fused the neural network’s prowess in handling unstructured data with a reward system that incentivized desired outcomes. This potent combination allowed AI to learn through simulated interactions, progressively honing its skills over time. Unlike supervised learning, which requires labeled data, deep reinforcement learning agents learn by interacting with their environment and receiving rewards for achieving specific goals. 

One of the most notable examples of deep reinforcement learning in action is AlphaGo, developed by DeepMind. This AI program famously defeated professional Go champions in 2016, showcasing the potential of AI to master complex games through self-directed learning. Beyond games, deep reinforcement learning has applications in various fields, including robotics, autonomous vehicles, and industrial process optimization. 

Games: The Training Ground 

Games, in fact, proved to be the perfect, controlled environment for deep reinforcement learning to demonstrate its prowess. Games like Go presented complex strategic challenges requiring high levels of decision-making, planning, and adaptation. These environments provided a safe space for AI agents to explore, experiment, and learn without the risk of real-world consequences. The success of AI in these game environments would serve as a powerful testament to its potential for tackling real-world problems. 

Real-World Applications

The principles honed within the confines of games quickly transitioned to the real world. From optimizing energy consumption in data centers to navigating the intricate world of protein folding, deep reinforcement learning began tackling problems that once resided solely within the domain of human expertise. In healthcare, for example, AI is being used to develop treatment plans for cancer patients and optimize drug discovery processes. 

Although the earlier developments ignited excitement among those who worked on them, it wasn’t until the advent of generative AI that a widespread and profound realization occurred – a lightning flash of awareness within the general population. 

Generative AI and Large Language Models: The New Creators

The emergence of generative AI and large language models is nothing short of extraordinary. These systems possess the remarkable ability to create entirely new content, from captivating prose and evocative poetry to stunning visuals and beyond. 

For example, Anthropic’s Claude 3.5 Sonet, launched in June 2024, included a new feature called Artifacts. This feature enables the user to create code and, therefore, interactive objects/dashboards all within the LLM interface. These models began to significantly impact various industries dependent on creative content, from marketing and advertising to software development and design. In the marketing world, LLMs started to be used to generate personalized ad copy and product descriptions that resonate with target audiences. In the field of software development, generative AI was being used to facilitate code generation and automate repetitive tasks, improving developer productivity. Additionally, AI-powered design tools started assisting designers in creating logos, prototypes, and marketing materials, streamlining the creative process. 

The LLM (Large Language Model)

The release of ChatGPT by OpenAI in November 2022 marked a watershed moment in AI history. This massive language model, containing 175 billion parameters (the dials that fine-tune how the model understands and generates language), showcased an AI’s prowess in producing human-quality text. It could answer intricate questions, craft summaries, and even compose short stories with an unprecedented level of fluency and coherence. It was quickly followed by the launch of the GPT Store, enabling people to build their own bespoke GPTs. 

Following the trail blazed by GPT, a wave of ever-more advanced generative models swept through the landscape.

And Not Forgetting the Hardware


The emergence of generative Al, with its ability to produce human-quality text, images, and code, has sparked a significant demand for more powerful hardware, particularly specialized chips. This has been a boon for established players like Nvidia, whose share price has seen substantial and sustained growth throughout generative Al’s development. However, it has also opened doors for new companies like Groq (not to be confused with the LLM launched by Elon Musk).


Groq pioneered a novel approach called “inference,” which allows for answering simple questions without requiring the full power of a large language model (LLM).
This enables significantly faster response times and reduced costs, paving the way for more intuitive and natural interactions with Al. Additionally, it allows for the development of cost-effective hardware devices that integrate Al, making it possible to interact with Al through voice queries and receive responses for just a few cents, making the technology more accessible for everyday use.


These advancements, along with continued research and development, are expected to further drive the growth of cost-efficient Al hardware and foster even more intuitive and ubiquitous interactions with Al in the years to come.

Native Voice LLMs

Native Voice Large Language Models (LLMs) are at the forefront of AI advancements, enabling more natural and intuitive interactions between humans and machines. These models can generate and understand speech that closely mimics human conversation, significantly enhancing user experiences in various applications such as virtual assistants and customer service bots.

OpenAI recently showcased its latest model, GPT-4o, at its developer conference in May 2024. The native voice model is still in early beta and has not yet been fully launched, but it is being demonstrated at various conferences. Unlike previous models that used separate pipelines for speech-to-text and text-to-speech, GPT-4o is trained end-to-end to handle audio, vision, and text inputs. This integration reduces latency and enhances the overall interaction quality, making GPT-4o capable of real-time voice recognition and generation. This model aims to provide instantaneous and natural interactions, aligning closely with human conversational speed and fluidity.

Source: Introducing GPT-4o, OpenAI

In contrast, the French AI company Kyutai has released a new model called Moshi, which is currently available for testing in an open beta environment. Launched in July 2024, Moshi features an exceptionally low latency of just 200 milliseconds, making it ideal for real-time communication applications. This low latency ensures that responses are nearly instantaneous, creating a seamless and engaging user experience. Moshi’s development emphasizes the potential for small, innovative teams to make significant advancements in AI technology, pushing the boundaries of what’s possible with voice-enabled applications.

These advancements from OpenAI and Kyutai highlight the transformative potential of native voice LLMs. OpenAI’s GPT-4o, though still in beta, promises to make human-machine interactions more intuitive and efficient, while Kyutai’s Moshi, available for immediate testing, showcases the practical applications and immediate benefits of low-latency voice AI technology.

Open-Source AI: The Democratization of Technology

The movement to open-source AI models accelerated the democratization of this powerful technology. By making state-of-the-art models accessible to a wider community of developers and researchers, these initiatives started to foster a vibrant ecosystem of collaboration and innovation. 

The Diffusion Model

Diffusion models have revolutionized the generative AI landscape, particularly in image synthesis, by turning chaotic digital noise into coherent, detailed imagery. This transformation process is akin to an artist beginning with a blank canvas, progressively refining and shaping it into a masterpiece. 

While DALL-E 2 by OpenAI had been a notable application in demonstrating these capabilities, Midjourney emerged as the standout platform in pushing the boundaries of AI-driven creativity. 

Midjourney, particularly with the release of Version 5.1 on May 3, 2023, marked a significant milestone in achieving a new level of realism in AI-generated images. This version was pivotal in showcasing the profound capabilities of diffusion models, illustrated by the viral image of the Pope wearing a white puffer jacket—a visually striking piece that blurs the lines between AI-generated content and genuine photography. This image not only captivated the public’s imagination but also served as a powerful testament to the sophistication and potential of diffusion models in crafting images that resonate on a human level. 

Midjourney V6: Mona Lisa in the style of George Orwell’s 1984
Source: Artificial World, X (Formerly Twitter)

Subsequent developments in Midjourney, such as enhanced panning and in-filling capabilities, have further elevated the platform’s ability to generate intricate and contextually rich imagery, showcasing the ongoing evolution and refinement of AI’s creative potential. 

For those intrigued by the intersection of technology and creativity, Midjourney’s achievements represent a fascinating area of exploration. The platform’s ongoing advancements reflect a broader trend in AI’s capacity to enhance and redefine the creative process, offering a glimpse into a future where AI’s role in art and design is both transformative and integral.  

Flux for Text

The newly released diffusion model from Germany’s ‘Black Forest Labs’ called ‘Flux’ demonstrates high prompt adherence enabling text to be rendered accurately.

In addition to rendering text with unprecedented accuracy, Flux generates high-quality images in under 2 seconds. This rapid processing points to where we are heading –  the generation of content within the loading time of an ad unit.

Generative Audio 

In 2022, the landscape of generative AI in audio took some remarkable strides, particularly in the realm of creating songs. This trend put pressure on music labels to innovate as the industry edges closer to a new era of productivity and creativity.

Among these developments, we’ve seen various platforms and tools emerge, pushing the boundaries of how music is generated and customized. Adobe’s Project Music Gen AI Control serves as an example, illustrating the potential of AI to convert text descriptions into adaptable tunes. While this innovation from Adobe is significant, it’s part of a broader movement that includes other platforms leveraging AI to create music.  

These advancements signal a shift towards more automated and personalized content creation, challenging traditional production methods and offering new opportunities for artists and producers. This period of innovation underscores the growing influence of AI in transforming the creative landscape, marking a step forward in how music is conceived, produced, and experienced. 

ElevenLabs, a pioneer in the generative audio space, offers state-of-the-art AI-powered tools that transform how users create and interact with audio content. Their products include voice synthesis, audio cloning, and real-time translation, allowing users to generate natural, human-like speech in various languages and accents. 

In May 2024 the company launched text-to-sound effects, allowing users to generate sound effects, short instrumental tracks, soundscapes, and a wide variety of character voices all from a text prompt. 

Multimodal AI

Multimodal AI, a significant leap forward in the field, integrates various data inputs to create a more comprehensive understanding of the world for AI systems. This convergence of modalities empowers machines to process and interpret information in a way that mimics human perception. Multimodal AI allows for the input of diverse data types, including text, audio, video, and sensor data, leading to a richer and more nuanced understanding of the world.

Tools like Google’s Gemini and ChatGPT-4o analyze uploaded images and videos and “decipher” the content. This allows people to point their phones at a picture or video capture and gain insights like identified components, potentially revolutionizing image search, accessibility tools, and even creative image editing.

One of the early examples was a conversation between Ethan Mollick of Wharton Business School and ChatGPT/Bing, where Ethan asked questions about a nuclear reactor control table with hilarious responses from the AI (see example of the chat).

It was at this point when multimodal interactions with AI became available that we started to witness the full realization of AI’s emergence – that the LLM and diffusion models had developed their own ‘proto-sentience.’

As evidenced by the GPT Vision AI’s vulnerability to optical illusions, many AI researchers scratch their heads as to why a computer vision system that analyzes data at a pixel level would exhibit the same features or bugs specific to the idiosyncratic nature of the human brain.

Text to Worlds: The Genie of Creation

OpenAI’s unveiling of its advanced generative model Sora, capable of transforming text into video, set a new benchmark in the field of generative AI. In a similar vein, Google DeepMind introduced Genie, a ground-breaking model that turns short descriptions, sketches, or photos into playable video games reminiscent of classic 2D platformers like Super Mario Bros.

Unlike the swift gameplay of contemporary games, Genie’s creations progress at a more measured pace of one frame per second. While in its nascent stage, this technology demonstrates the potential to significantly lower the barriers to video game creation, making it accessible to a wider audience without the need for sophisticated programming skills.

Genie differentiates itself by being trained solely on video footage of various 2D platform games, a departure from previous approaches that required pairing video frames with corresponding input actions, such as button presses. This method allowed for the utilization of vast amounts of video data available online, simplifying the training process and expanding the model’s learning potential.

Genie’s capacity to generate games from simple sketches or images on the fly, adapting the gameplay based on player actions, showcases the advancements in AI’s creative capabilities. Despite its current limitations in frame rate, future iterations of Genie promise improvements in speed and complexity, potentially offering new tools for creativity and game development, as well as applications in robotics and other fields.

This ground-breaking technology holds immense potential for gaming, simulation, and interactive training industries. Users can describe the desired environment, characters, and storyline, and Genie will bring their vision to life, creating an immersive and dynamic world for exploration and interaction. Genie opens doors for innovative and immersive gaming experiences that were previously unimaginable.

As remarkable as these developments are, we have yet to see their full impact on the world around us: new businesses, services, applications, and devices. But it is coming. After the flash of lightning, the boom of thunder is now upon us.

Complex Reasoning and Problem Solving

Released on September 12, 2024, OpenAI’s o1 model represents a significant advancement in artificial intelligence, particularly in its ability to handle complex reasoning tasks. Previously codenamed “Strawberry,” the model is designed to “think before it answers,” employing a chain-of-thought approach to tackle intricate problems in fields such as science, coding, and mathematics. 

The o1 model comes in two variants: o1-preview, which is the main model with broad capabilities, and o1-mini, a smaller, more efficient version optimized for coding tasks. Both models have demonstrated impressive performance in various benchmarks. For instance, o1 scored 83% on International Mathematical Olympiad qualifying exams, a significant improvement over its predecessor GPT-4o’s 13%. It also reached the 89th percentile in Codeforces programming challenges and outperformed PhD-level experts in physics, biology, and chemistry problems. 

The model’s enhanced reasoning capabilities are attributed to its training using reinforcement learning, which allows it to refine its thinking process, try different strategies, and recognize mistakes. Despite these advancements, o1 has some limitations, including slower processing times and the current lack of features like web browsing and image processing. OpenAI plans to regularly update and improve the o1 series, potentially expanding its capabilities and addressing current limitations in future iterations. 

Text to Action: The Next Frontier in AI

The next frontier will be text-to-action. This innovation will combine the existing capabilities – such as text-to-code – with function calling, enabling AI to perform a series of linked tasks like searching the web, sending SMS messages, or making phone calls—all in response to a single user request.

This capability will be realized through the development of agentic frameworks—a series of linked prompts connected to a context window with the ability for function calling. Consequently, a single command will be able to trigger a chain of actions, all guided by the AI’s understanding of the user’s intent and context. Recent advancements, such as OpenAI’s function calling and Google’s Project Astra, have already set the groundwork for this future.

This development is set to transform the user experience of generative AI as it is integrated into everyday assistants such as Apple’s intelligence, Google Assistant, or Amazon’s Alexa. Soon, with just a simple voice or text command, users will have the power to initiate and complete complex actions effortlessly, bringing us closer to a future where our digital assistants are more capable and intuitive than ever before. These advancements are expected to significantly enhance productivity and user experience, marking a major leap forward in AI’s integration into daily life.

This presents significant implications for marketing. As AI automatically decides on the brand of choice and facilitates conversions, it will upend marketing. For some categories—particularly the low-interest or undifferentiated categories—the focus will shift towards influencing the Large Language Models (LLMs) rather than directly targeting the minds of the audience.

The pace at which AI technology has evolved has been staggering, but what’s even more intriguing is how these advancements are starting to shape our interactions and experiences. Whether it’s the consumer-facing AI enhancements in products we use every day or the seamless integration of AI in business operations, the impact will be profound and far-reaching for people living, working, and learning with GenAI.

From the foundations of machine learning and artificial neural networks to the text-to-worlds and agentic capabilities of contemporary generative AI, a remarkable series of lightning flashes have propelled this profound technology forward. The implications on how we live, and work are now playing out all around us. The impact on marketing is equally profound – from generative strategy to the simulation of market dynamics, few areas of marketing will remain untouched by AI. In the next chapter, we explore this impact, the thunder of change rolling through our industry.

The Generative AI Wave

Predicted Penetration of GenAI Development: 2024-2028

Omnicom Media Group has projected monthly growth of Generative AI services and experiences from 2024 to 2028 in a report: The Coming Wave: Gen-AI. The report examines the projected monthly penetration of core Generative AI (GenAI) services and experiences in a typical developed market from 2024 to 2028.

Based on expert predictions from a range of sources, the analysis starts with current penetration rates and projects forward, identifying the upper bound of penetration and growth potential. It also establishes the speed of penetration change and the timing of inflection points where the rate of growth accelerates due to market forces. The data is constantly reviewed and updated to reflect the latest trends and developments.