Skip to main content
Loading the Elevenlabs Text to Speech AudioNative Player...

TL;DR: This article explores the rapid advancements in AI, tracing a path from the fundamental algorithms of machine learning to the frontiers of multimodal AI applications. It highlights the evolution of algorithms, the rise of deep reinforcement learning, and the emergence of generative AI and large language models. The section also explores the democratization of AI through open-source initiatives and the ground-breaking developments in diffusion models, generative audio, and multimodal AI.

  • The foundations of machine learning
  • The evolution of algorithms and the rise of neural networks
  • Deep reinforcement learning and mastery through practice
  • Generative AI and large language models
  • Open-source AI and the democratization of technology
  • Breakthroughs in diffusion models, generative audio, and multimodal AI

The story of Artificial Intelligence (AI) unfolds like a sequence of cascading breakthroughs, each igniting the possibilities of the technology’s potential like flashes of lightning illuminating a stormy night. This article chronicles these advancements, tracing a path from the fundamental algorithms of machine learning to the frontiers of multimodal AI applications. 

The Foundations of Machine Learning

Our contemporary AI revolution was largely driven by machine learning (ML), a computational process that enables systems to learn from data without explicit programming. This paradigm shift unlocked a world where machines could not only process information but actively evolve their understanding. 

Pioneering researchers like Arthur Samuel in the 1950s and Tom Mitchell in the 1970s laid the groundwork for this field, establishing core principles like supervised, unsupervised, and reinforcement learning. These algorithms, initially simple, formed the foundation for the sophisticated models that power AI today. 

Examples of more recent applications of machine learning include: 

  • Recommendation engines: Pioneered by companies like Amazon and Netflix, these systems analyze user behavior patterns to suggest relevant products or content.
  • Spam filtering: Google and other email providers utilize machine learning to identify and filter out unwanted spam messages. 
  • Fraud detection: Financial institutions leverage machine learning to identify and prevent fraudulent transactions. 
  • Maps and navigation: Google Maps and other navigation apps employ machine learning to optimize routes, predict traffic patterns, and suggest alternative routes in real-time. 

These are just a few examples of how machine learning has transformed various industries, paving the way for a future where AI continues to reshape our world. 

Evolution of Algorithms 

The earliest algorithms were but a faint echo of their present-day counterparts. Decision trees, akin to flowcharts, were among the initial attempts at building intelligent systems. These rudimentary algorithms classified data based on a series of predefined rules, but their inflexibility limited their ability to handle complex tasks. As the field progressed, more sophisticated algorithms emerged. 

The true turning point came with the rise of artificial neural networks. Inspired by the structure and functions of the human brain, these networks consist of interconnected nodes, each simulating a neuron. By adjusting the connections between these nodes based on training data, neural networks can learn complex patterns and relationships within the data. The development of the backpropagation algorithm in the 1980s, which allowed for efficient training of these networks, further propelled their advancement. 

As algorithms advanced, so did the understanding of the pivotal role data plays in AI. Once seen as cumbersome, vast datasets became the lifeblood of machine learning. The ability to collect, store, and process massive amounts of data became crucial for training and improving AI models. The rise of Big Data technologies and advancements in distributed computing facilitated the handling of these ever-growing datasets. 

However, data alone is not enough. Quality and relevance of data significantly impact the performance of AI models. Techniques like data cleaning and feature engineering became essential in preparing data for effective utilization in machine learning tasks. 

Deep Reinforcement Learning: Mastery Through Practice

Deep reinforcement learning fused the neural network’s prowess in handling unstructured data with a reward system that incentivized desired outcomes. This potent combination allowed AI to learn through simulated interactions, progressively honing its skills over time. Unlike supervised learning, which requires labeled data, deep reinforcement learning agents learn by interacting with their environment and receiving rewards for achieving specific goals. 

One of the most notable examples of deep reinforcement learning in action is AlphaGo, developed by DeepMind. This AI program famously defeated professional Go champions in 2016, showcasing the potential of AI to master complex games through self-directed learning. Beyond games, deep reinforcement learning has applications in various fields, including robotics, autonomous vehicles, and industrial process optimization. 

Games: The Training Ground 

Games, in fact, proved to be the perfect, controlled environment for deep reinforcement learning to demonstrate its prowess. Games like Go presented complex strategic challenges requiring high levels of decision-making, planning, and adaptation. These environments provided a safe space for AI agents to explore, experiment, and learn without the risk of real-world consequences. The success of AI in these game environments would serve as a powerful testament to its potential for tackling real-world problems. 

Real-World Applications

The principles honed within the confines of games quickly transitioned to the real world. From optimizing energy consumption in data centers to navigating the intricate world of protein folding, deep reinforcement learning began tackling problems that once resided solely within the domain of human expertise. In healthcare, for example, AI is being used to develop treatment plans for cancer patients and optimize drug discovery processes. 

Although the earlier developments ignited excitement among those who worked on them, it wasn’t until the advent of generative AI that a widespread and profound realization occurred – a lightning flash of awareness within the general population. 

Generative AI and Large Language Models: The New Creators

The emergence of generative AI and large language models is nothing short of extraordinary. These systems possess the remarkable ability to create entirely new content, from captivating prose and evocative poetry to stunning visuals and beyond. 

The LLM (Large Language Model)

The release of ChatGPT by OpenAI in November 2022 marked a watershed moment in AI history. This massive language model, containing 175 billion parameters (the dials that fine-tune how the model understands and generates language), showcased an AI’s prowess in producing human-quality text. It could answer intricate questions, craft summaries, and even compose short stories with an unprecedented level of fluency and coherence. It was quickly followed by the launch of the GPT Store, enabling people to build their own bespoke GPTs. 

Following the trail blazed by GPT, a wave of ever-more advanced generative models swept through the landscape.

And Not Forgetting the Hardware


The emergence of generative Al, with its ability to produce human-quality text, images, and code, has sparked a significant demand for more powerful hardware, particularly specialized chips. This has been a boon for established players like Nvidia, whose share price has seen substantial and sustained growth throughout generative Al’s development. However, it has also opened doors for new companies like Groq (not to be confused with the LLM launched by Elon Musk).


Groq pioneered a novel approach called “inference,” which allows for answering simple questions without requiring the full power of a large language model (LLM).
This enables significantly faster response times and reduced costs, paving the way for more intuitive and natural interactions with Al. Additionally, it allows for the development of cost-effective hardware devices that integrate Al, making it possible to interact with Al through voice queries and receive responses for just a few cents, making the technology more accessible for everyday use.


These advancements, along with continued research and development, are expected to further drive the growth of cost-efficient Al hardware and foster even more intuitive and ubiquitous interactions with Al in the years to come.

For example, Anthropic’s Claude 3.5 Sonet, launched in June 2024, included a new feature called Artifacts. This feature enables the user to create code and, therefore, interactive objects/dashboards all within the LLM interface. These models began to significantly impact various industries dependent on creative content, from marketing and advertising to software development and design. In the marketing world, LLMs started to be used to generate personalized ad copy and product descriptions that resonate with target audiences. In the field of software development, generative AI was being used to facilitate code generation and automate repetitive tasks, improving developer productivity. Additionally, AI-powered design tools started assisting designers in creating logos, prototypes, and marketing materials, streamlining the creative process. 

Native Voice LLMs

Native Voice Large Language Models (LLMs) are at the forefront of AI advancements, enabling more natural and intuitive interactions between humans and machines. These models can generate and understand speech that closely mimics human conversation, significantly enhancing user experiences in various applications such as virtual assistants and customer service bots.

OpenAI recently showcased its latest model, GPT-4o, at its developer conference in May 2024. The native voice model is still in early beta and has not yet been fully launched, but it is being demonstrated at various conferences. Unlike previous models that used separate pipelines for speech-to-text and text-to-speech, GPT-4o is trained end-to-end to handle audio, vision, and text inputs. This integration reduces latency and enhances the overall interaction quality, making GPT-4o capable of real-time voice recognition and generation. This model aims to provide instantaneous and natural interactions, aligning closely with human conversational speed and fluidity.

Source: Introducing GPT-4o, OpenAI

In contrast, the French AI company Kyutai has released a new model called Moshi, which is currently available for testing in an open beta environment. Launched in July 2024, Moshi features an exceptionally low latency of just 200 milliseconds, making it ideal for real-time communication applications. This low latency ensures that responses are nearly instantaneous, creating a seamless and engaging user experience. Moshi’s development emphasizes the potential for small, innovative teams to make significant advancements in AI technology, pushing the boundaries of what’s possible with voice-enabled applications.

These advancements from OpenAI and Kyutai highlight the transformative potential of native voice LLMs. OpenAI’s GPT-4o, though still in beta, promises to make human-machine interactions more intuitive and efficient, while Kyutai’s Moshi, available for immediate testing, showcases the practical applications and immediate benefits of low-latency voice AI technology.

Open-Source AI: The Democratization of Technology

The movement to open-source AI models accelerated the democratization of this powerful technology. By making state-of-the-art models accessible to a wider community of developers and researchers, these initiatives started to foster a vibrant ecosystem of collaboration and innovation. 

Previously, access to cutting-edge AI models was often restricted to large corporations and research institutions due to the significant computational resources required for training and running them. Meta played a pivotal role in this open-source movement with LLaMA2, and their recently launched LLaMA3.1, enabling developers to create their own LLMs.  

Major Large Language Model (LLM) Developments:

The Key Players

 

This content is created generatively by AI, and reviewed by a human. Check back regularly for the latest updates and information. 

 

OpenAI’s ChatGPT-4 and 4o:

  • Launched in March 2023, succeeding GPT-3.5. 
  • Multimodal LLM, accepting image and text inputs. 
  • Introduced ‘function calling’ for seamless API integration. 
  • Powers plugins in the ChatGPT app store for extended functionality. 
  • ChatGPT-4o (ChatGPT-4Omni) launched April 2024, offering a universal and adaptable AI solution with enhanced capabilities across various domains, including language translation, complex problem-solving, and context-aware assistance.
  • Partnership with Apple announced in June 2024, aiming to integrate ChatGPT capabilities into Apple’s ecosystem, enhancing Siri’s functionality, and embedding advanced AI features across Apple’s suite of products and services.
  • OpenAI reveals SearchGPT in July 2024, an advanced search AI designed to provide up-to-date and comprehensive search results, perhaps starting the “war” with Google search engine. Further information remains to be seen. 
  • OpenAI releases GPT-4o mini, a compact version of OpenAI’s GPT-4o, maintaining high performance and versatility while being more cost-effective and faster 
  • ChatGPT’s voice mode, allowing users to interact with the chatbot vocally, has moved beyond the beta phase and is now accessible to ChatGPT Plus and Enterprise users. Media planners should consider the potential of voice-activated content and advertising as voice interaction with AI becomes more prevalent. The rise of voice-based AI could lead to new opportunities for audio advertising and branded podcasts.
  • OpenAI has released a new tool designed to detect AI-generated text, although it is not yet public. Marketers need to stay informed about tools that can detect AI-generated content to ensure transparency and authenticity. 
  • OpenAI faces a class-action lawsuit alleging the unauthorized use of YouTube transcripts for training data. The lawsuit highlights growing concerns about data privacy and copyright in AI development, which could impact how brands use and market AI technologies. 

Anthropic’s Claude:

  • Claude v1 launched in 2023, Claude 2 in July 2023, and Claude 3 in March 2024.
  • Claude 3 offers three models: Haiku, Sonnet, and Opus, with Opus outperforming GPT-4 and Gemini 1.0 Ultra in certain benchmarks. 
  • Focuses on safety, alignment with human values, and following complex instructions. Anthropic is offering rewards of up to $15,000 for successfully executing “universal jailbreak attacks” on its Claude AI model. The focus on security highlights the importance of choosing AI partners with robust safety protocols to protect brand reputation and user trust. 
  • Offers user-friendly APIs for easy integration and near-instant response times. 
  • On June 20, 2024, Anthropic released Claude 3.5 Sonnet, which demonstrated significantly improved performance in areas such as coding, multistep workflows, chart interpretation, and text extraction from images.

Meta’s LLaMA:

  • Original LLaMA models released in early 2023, with sizes ranging from 7B to 65B parameters.
  • LLaMA 2 released in July 2023, building upon the original LLaMA’s success. 
  • Open-sourced models, allowing for community development and custom applications.
  • Known for versatility and adaptability across various tasks.
  • LLaMA 3 released April 2024, emphasizing more efficient training methods, increased parameter sizes up to 80B, and better fine-tuning capabilities for specific tasks.
  • LLaMA 3.1 family of models released in July 2024. The 405B parameter model is the first open-source model to beat GPT-4, Claude 3.5 Sonnet, and other leading AIs in a lot of benchmarks, providing enhanced mathematical problem-solving skills and coding capabilities as well as multilingual support. 

Google’s Gemini:

  • Launched as Bard in March 2023, rebranded as Gemini in February 2024.
  • Gemini Advanced (powered by Ultra LLM) released as part of Google’s One AI Premium subscription. 
  • Natively integrated within the web, enabling seamless browsing experiences and integration with Google apps.
  • Trained on extensive web data, offering broad knowledge coverage and interactive features.
  • Gemini 1.5 announced at Google I/O, May 2024, with Google also previewing Gemini Live, a voice chat mode, and Gems, the ability to create custom chatbots.
  • Google’s Gemini AI is now accessible in personal Gmail accounts on Android devicesThe integration of Gemini into Gmail suggests a growing trend of AI assistants becoming embedded in everyday applications. 
  • Gemini 1.5 Flash sees a price reduction and new features enhance its ability to handle PDFs. 

 Other Notable LLMs:

  • Huawei’s Pangu Model in China, known for its massive parameter count and applications in industrial and scientific research, released an upgraded version (5.0) in June 2024 with enhanced natural language processing and real-time data integration capabilities.
  • G42’s Falcon in the Middle East, from the UAE, focuses on high-accuracy language translation and AI-driven analytics. Their Falcon 2.0 model launched with expanded capabilities in multilingual support and enterprise solutions.
  • Baidu’s ERNIE, in China, updated in April 2024, continues to push the envelope with innovations in context-aware text generation and sophisticated AI algorithms for large-scale applications in both consumer and industrial markets.
  • Mistral Large 2 (123B parameters – 128K tokens) released in July 2024, (one day after LLaMA 3.1). Natively multilingual, open-source, offering much more advanced reasoning capabilities, scoring higher on many leading benchmarks (MLU) than GPT-4o or LLaMA3.1 405B with just 1/3 of the parameters of the LLaMA3.1 405B. 

These LLMs – and the competition between them – have revolutionized AI since early 2023, with each model showcasing unique strengths and rapid development. Their advanced language understanding, multimodal capabilities, and user-friendly interfaces have opened new possibilities for developers and users alike. 

The Diffusion Model

Diffusion models have revolutionized the generative AI landscape, particularly in image synthesis, by turning chaotic digital noise into coherent, detailed imagery. This transformation process is akin to an artist beginning with a blank canvas, progressively refining and shaping it into a masterpiece. 

While DALL-E 2 by OpenAI had been a notable application in demonstrating these capabilities, Midjourney emerged as the standout platform in pushing the boundaries of AI-driven creativity. 

Midjourney, particularly with the release of Version 5.1 on May 3, 2023, marked a significant milestone in achieving a new level of realism in AI-generated images. This version was pivotal in showcasing the profound capabilities of diffusion models, illustrated by the viral image of the Pope wearing a white puffer jacket—a visually striking piece that blurs the lines between AI-generated content and genuine photography. This image not only captivated the public’s imagination but also served as a powerful testament to the sophistication and potential of diffusion models in crafting images that resonate on a human level. 

Midjourney V6: Mona Lisa in the style of George Orwell’s 1984
Source: Artificial World, X (Formerly Twitter)

Subsequent developments in Midjourney, such as enhanced panning and in-filling capabilities, have further elevated the platform’s ability to generate intricate and contextually rich imagery, showcasing the ongoing evolution and refinement of AI’s creative potential. 

For those intrigued by the intersection of technology and creativity, Midjourney’s achievements represent a fascinating area of exploration. The platform’s ongoing advancements reflect a broader trend in AI’s capacity to enhance and redefine the creative process, offering a glimpse into a future where AI’s role in art and design is both transformative and integral.  

Flux for Text

The newly released diffusion model from Germany’s ‘Black Forest Labs’ called ‘Flux’ demonstrates high prompt adherence enabling text to be rendered accurately.

In addition to rendering text with unprecedented accuracy, Flux generates high-quality images in under 2 seconds. This rapid processing points to where we are heading –  the generation of content within the loading time of an ad unit.

Major Diffusion Models: The Key Players

  • Stable Diffusion (Stability AI)
    Stable Diffusion is an open-source AI model that creates images from text descriptions. Its latest version, SDXL, offers high-quality image generation. The SDXL Turbo feature allows for near-instant image creation, significantly reducing generation time. (Stability AI)
  • DALL-E 3 (OpenAI)
    DALL-E 3 generates detailed images from text prompts. It features outpainting (extending images beyond their original borders), inpainting (editing specific parts of an image), and variations (creating different versions of an input image). These capabilities allow for versatile image creation and editing. (DALLE-3)
  • Imagen (Google)
    Imagen is Google’s text-to-image AI model known for producing highly realistic images. While specific features are not widely known due to limited public access, Google has announced Imagen 3, which is currently in limited beta testing. This new version is expected to offer improved capabilities, though details are not yet fully available. (Imagen)
  • Midjourney
    MidJourney’s Version 6.1, released in late July 2024, brings significant enhancements to AI-driven art generation. The update improves image quality, detail, and stylistic versatility, allowing for the creation of more sophisticated and visually appealing artwork. New features include expanded customization options and faster rendering times. The update also includes improvements to the user interface, making it more accessible for both novice and experienced users.  (Midjourney)
  • Titan Generator (Amazon)
    Amazon Titan Image Generator was initially launched in preview at the AWS re 2023 conference. It allowed users to generate images from natural language prompts, edit existing images, and create image variations (April 2023) In August 2024, Amazon released Titan Image Generator v2, which introduced several enhanced features such as image conditioning, background removal, and subject consistency. Amazon’s Titan Image Generator v2 is gaining popularity among AWS customers, particularly in the advertising and e-commerce sectors, for its ability to create high-quality product images. The increased use of AI image generators could lead to a rise in visually rich and personalized advertising content. Marketers should consider how AI tools like Titan can augment creative workflows and potentially impact the demand for certain creative skills.
  • Flux AI (Black Forest Labs)
    A new state-of-the-art text-to-image model developed by the team behind Stable Diffusion. The company has recently launched Flux AI with significant funding of $31 million, led by Andreessen Horowitz and supported by notable figures like Brendan Iribe and Garry Tan. Flux AI stands out due to its impressive capabilities, including handling complex text, intricate scene compositions, and realistic human anatomy. One of the notable features of Flux AI is its ability to generate highly detailed and realistic images, maintaining high fidelity in prompt adherence and visual quality. It employs a hybrid architecture combining multimodal and parallel diffusion transformer blocks, which enhances its performance and efficiency. (Flux1Pro) 

Generative Audio 

In 2022, the landscape of generative AI in audio took some remarkable strides, particularly in the realm of creating songs. This trend put pressure on music labels to innovate as the industry edges closer to a new era of productivity and creativity. Among these developments, we’ve seen various platforms and tools emerge, pushing the boundaries of how music is generated and customized. Adobe’s Project Music Gen AI Control serves as an example, illustrating the potential of AI to convert text descriptions into adaptable tunes. While this innovation from Adobe is significant, it’s part of a broader movement that includes other platforms leveraging AI to create music.  

These advancements signal a shift towards more automated and personalized content creation, challenging traditional production methods and offering new opportunities for artists and producers. This period of innovation underscores the growing influence of AI in transforming the creative landscape, marking a step forward in how music is conceived, produced, and experienced. 

ElevenLabs, a pioneer in the generative audio space, offers state-of-the-art AI-powered tools that transform how users create and interact with audio content. Their products include voice synthesis, audio cloning, and real-time translation, allowing users to generate natural, human-like speech in various languages and accents. 

In May 2024 the company launched text-to-sound effects, allowing users to generate sound effects, short instrumental tracks, soundscapes, and a wide variety of character voices all from a text prompt. 

Multimodal AI

Multimodal AI, a significant leap forward in the field, integrates various data inputs to create a more comprehensive understanding of the world for AI systems. This convergence of modalities empowers machines to process and interpret information in a way that mimics human perception. Multimodal AI allows for the input of diverse data types, including text, audio, video, and sensor data, leading to a richer and more nuanced understanding of the world.

Tools like Google’s Gemini and ChatGPT-4o analyze uploaded images and videos and “decipher” the content. This allows people to point their phones at a picture or video capture and gain insights like identified components, potentially revolutionizing image search, accessibility tools, and even creative image editing.

One of the early examples was a conversation between Ethan Mollick of Wharton Business School and ChatGPT/Bing, where Ethan asked questions about a nuclear reactor control table with hilarious responses from the AI (see example of the chat).

It was at this point when multimodal interactions with AI became available that we started to witness the full realization of AI’s emergence – that the LLM and diffusion models had developed their own ‘proto-sentience.’

As evidenced by the GPT Vision AI’s vulnerability to optical illusions, many AI researchers scratch their heads as to why a computer vision system that analyzes data at a pixel level would exhibit the same features or bugs specific to the idiosyncratic nature of the human brain.

Text to Film: Runway and Sora

Runway and Sora stand at the forefront of multimodal AI, pushing the boundaries of what AI can create. Runway is a creative platform that enables users to experiment with AI-powered tools for video editing, music generation, and text-to-image generation.

Sora is a cutting-edge diffusion model designed for video generation. It starts with an initial state resembling static noise and progressively refines this into a coherent video. It has the unique ability to generate videos in their entirety or to extend existing ones, ensuring consistent representation of subjects throughout, even when they temporarily exit the scene.

Leveraging a transformer architecture similar to that used in GPT models, Sora achieves enhanced scalability. Videos and images are broken down into patches, similar to tokens in GPT, allowing for training across a vast spectrum of visual data, including various durations, resolutions, and aspect ratios. Drawing upon research from DALL-E and GPT models, particularly employing DALL-E 3’s recaptioning technique, Sora excels in adhering to textual instructions and producing videos that are remarkably aligned with user prompts. It can animate still images or extend and enrich existing videos with unprecedented detail. This model is paving the way for AI’s capability to understand and replicate the complexity of the real world, marking a significant step toward achieving AGI.

Sora can create mid-length films of a minute in duration, showcasing unbelievable fidelity and photorealistic accuracy, demonstrating the profound advances in AI’s ability to generate complex, dynamic visual content.

Text to Video: Diffusion Transformers – OpenAI’s Sora

Sora prompts: Extreme close-up of a 24-year-old woman’s eye blinking, standing in Marrakech during magic hour. Drone view of waves crashing against the rugged cliffs along Big Sur’s Garay Point beach. A litter of golden retriever puppies playing in the snow. The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road. A drone camera circles around a beautiful historic church on a rocky outcropping along the Amalfi coast. Tour of an art gallery with many beautiful works of art in

different frames. Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following a couple.

And then there is lip-synching. Early in 2024, Alibaba showcased a new image-to-video algorithm called EMO that could bring to life photographed individuals, animating them into speaking words or singing songs. This presented a new standard in the technology available to create natural cognitive assistants.

Image to Film

With diffusion models being used to generate images that are then loaded into text-to-video models to create a film based on the image.

With a new feature from Runway Gen-3 that allows users to set the film to end with the last frame matching the uploaded image. See example. 

Text-to-Video: The Four Major Players

  • Sora (OpenAI)
    Sora is OpenAI’s text-to-video model, currently in limited beta. It can generate videos up to a minute long from text prompts, maintaining visual quality and adhering to instructions. Sora can create complex scenes with multiple characters and accurate details. It’s being tested by red teamers and select visual artists, designers, and filmmakers for feedback.
  • Runway Gen-3 (Runway)
    Introduced Gen-3 on June 17th but it became publicly available on July 1st. The model boasts enhanced video fidelity, resolution, realism, the ability to generate highly detailed complex video environment scenes (which is considered a big leap), and expressive human characters for up to 18 seconds long. It’s generally considered a significant advancement in generative AI technology. It has improved text understanding, allowing it to better interpret and translate text prompts into video content.
  • Veo (Google)
    Veo is Google’s text-to-video AI model, announced at the recent Google I/O conference. It aims to generate high-quality video content from text descriptions, leveraging Google’s expertise in AI and machine learning. However, Veo is not yet publicly available, and specific release plans have not been widely publicized. It remains in development and may be in limited beta testing.
  • Klingai
    KLING AI is released in July 2024, offering cutting-edge text-to-video generation that seems to rival SORA. Interestingly, while SORA was revealed much earlier, KLING is now available to the general public and offers surprisingly powerful capabilities.

Text to Worlds: The Genie of Creation

OpenAI’s unveiling of its advanced generative model Sora, capable of transforming text into video, set a new benchmark in the field of generative AI. In a similar vein, Google DeepMind introduced Genie, a ground-breaking model that turns short descriptions, sketches, or photos into playable video games reminiscent of classic 2D platformers like Super Mario Bros.

Unlike the swift gameplay of contemporary games, Genie’s creations progress at a more measured pace of one frame per second. While in its nascent stage, this technology demonstrates the potential to significantly lower the barriers to video game creation, making it accessible to a wider audience without the need for sophisticated programming skills.

Genie differentiates itself by being trained solely on video footage of various 2D platform games, a departure from previous approaches that required pairing video frames with corresponding input actions, such as button presses. This method allowed for the utilization of vast amounts of video data available online, simplifying the training process and expanding the model’s learning potential.

Genie’s capacity to generate games from simple sketches or images on the fly, adapting the gameplay based on player actions, showcases the advancements in AI’s creative capabilities. Despite its current limitations in frame rate, future iterations of Genie promise improvements in speed and complexity, potentially offering new tools for creativity and game development, as well as applications in robotics and other fields.

This ground-breaking technology holds immense potential for gaming, simulation, and interactive training industries. Users can describe the desired environment, characters, and storyline, and Genie will bring their vision to life, creating an immersive and dynamic world for exploration and interaction. Genie opens doors for innovative and immersive gaming experiences that were previously unimaginable.

As remarkable as these developments are, we have yet to see their full impact on the world around us: new businesses, services, applications, and devices. But it is coming. After the flash of lightning, the boom of thunder is now upon us.

Text to Action: The Next Frontier in AI

The next frontier will be text-to-action. This innovation will combine the existing capabilities – such as text-to-code – with function calling, enabling AI to perform a series of linked tasks like searching the web, sending SMS messages, or making phone calls—all in response to a single user request.

This capability will be realized through the development of agentic frameworks—a series of linked prompts connected to a context window with the ability for function calling. Consequently, a single command will be able to trigger a chain of actions, all guided by the AI’s understanding of the user’s intent and context. Recent advancements, such as OpenAI’s function calling and Google’s Project Astra, have already set the groundwork for this future.

This development is set to transform the user experience of generative AI as it is integrated into everyday assistants such as Apple’s intelligence, Google Assistant, or Amazon’s Alexa. Soon, with just a simple voice or text command, users will have the power to initiate and complete complex actions effortlessly, bringing us closer to a future where our digital assistants are more capable and intuitive than ever before. These advancements are expected to significantly enhance productivity and user experience, marking a major leap forward in AI’s integration into daily life.

This presents significant implications for marketing. As AI automatically decides on the brand of choice and facilitates conversions, it will upend marketing. For some categories—particularly the low-interest or undifferentiated categories—the focus will shift towards influencing the Large Language Models (LLMs) rather than directly targeting the minds of the audience.