AYA Released! 🚀 OpenAI SORA, 🌌 Google Gemini 1.5 and Gemma, 💰 $7 trillion for new AI chip project, 🛠️ V-JEPPA, 🎨 Stable Diffusion 3 and first AI Heist 🕵️‍♂️

AI Connections #48 - a weekly newsletter about interesting blog posts, articles, videos, and podcast episodes about AI

Feb 23, 2024

AYA IS RELEASED 🏕️

The Aya Dataset is a multilingual instruction fine-tuning dataset curated by an open-science community via Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators.

Dataset: https://huggingface.co/datasets/CohereForAI/aya_dataset

Dataset paper: https://arxiv.org/abs/2402.06619

The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks. This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See our paper for more details regarding the collection.

Aya collection: https://huggingface.co/datasets/CohereForAI/aya_collection

The Aya model is a massively multilingual generative language model that follows instructions in 101 languages. Aya outperforms mT0 and BLOOMZ a wide variety of automatic and human evaluations despite covering double the number of languages. The Aya model is trained using xP3x, Aya Dataset, Aya Collection, a subset of DataProvenance collection and ShareGPT-Command. We release the checkpoints under a Apache-2.0 license to further our mission of multilingual technologies empowering a multilingual world.

Model: https://huggingface.co/CohereForAI/aya-101
Model Paper:https://arxiv.org/abs/2402.07827

Take advantage of our special bundle sale on Linkedist Academy courses, including LinkedIn Personal Branding and LinkedIn Sales. This is a great chance to improve your LinkedIn profile and sales skills at a discounted price.

SPECIAL OFFER

READ 📚

“Sora: Creating video from text” blog post by OpenAI: READ

OpenAI released Sora, a model that can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt but also how those things exist in the physical world.

“Video generation models as world simulators” blog post by OpenAI: READ

This technical report outlines a method for converting diverse visual data into a unified representation for extensive training of generative models and evaluates the capabilities and limitations of Sora, a generalist model that generates videos and images of various durations, aspect ratios, and resolutions, including up to a minute of high-definition video. Unlike previous work that often focused on specific types of visual data or fixed video sizes, Sora encompasses a broader range of visual data generation.

The word “Gemma” and a spark icon with blueprint styling appears in a blue gradient against a black background.

“Gemma: Introducing new state-of-the-art open models” blog post by Google DeepMind: READ

Google introduces Gemma, a new generation of lightweight, state-of-the-art open models for developers and researchers, aimed at building AI responsibly with tools for safe application creation. Gemma models, available in two sizes and with support for major frameworks, come with a Responsible Generative AI Toolkit, ready-to-use notebooks, and easy deployment options on various platforms, marking a significant contribution to the open AI community.

“Gemini 1.5: next-generation model” blog post by Google DeepMind: READ

Google DeepMind announces Gemini 1.5, a next-generation AI model with significantly enhanced performance, including a new Mixture-of-Experts architecture for more efficient training and serving. Gemini 1.5 Pro, the first model released for early testing, is optimized for a wide range of tasks with a breakthrough experimental feature for long-context understanding, offering a context window of up to 1 million tokens for select users.

“Stable Diffusion 3” blog post by Stability AI: READ

Stable Diffusion 3 is announced in early preview as the most advanced text-to-image model, offering enhanced capabilities in handling multi-subject prompts, image quality, and text spelling. Although not widely available yet, a waitlist for early access is open to gather insights for further improvements in performance and safety before a broader release.

“My benchmark for large language models” blog post by Nicholas Carlini: READ

Nicholas Carlini has introduced a new benchmark for evaluating large language models (LLMs) on GitHub, featuring nearly 100 tests derived from his real interactions with various LLMs. The benchmark uniquely includes a simple dataflow domain specific language for adding tests, evaluating capabilities like code conversion, understanding minified JavaScript, identifying data encoding formats, parsing, and generating queries, with most tests assessed by executing the model-generated code.

新型多模态模型Adept Fuyu-Heavy 专为数字代理设计- 人工智能- 通信人家园- Powered by C114

“Adept Fuyu-Heavy: A new multimodal model” blog post by Adept AI: READ

Adept introduces Fuyu-Heavy, the world's third-most-capable multimodal model, notable for its excellence in multimodal reasoning, particularly in UI understanding, where it outperforms even Gemini Pro on the MMMU benchmark. Despite its focus on multimodal tasks, Fuyu-Heavy matches or exceeds the performance of similarly sized models on text-based benchmarks, showcasing the scalability and efficiency of the Fuyu architecture in handling diverse data types.

Phind-70B: BEST Coding LLM Outperforming GPT-4 Turbo + Opensource! - YouTube

“Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster” blog post by Phind: READ

Phind-70B, an advanced model fine-tuned from CodeLlama-70B, narrows the code quality gap with GPT-4 Turbo, operating four times faster and delivering high-quality technical responses at up to 80 tokens per second. Outperforming GPT-4 Turbo in HumanEval with a score of 82.3% and providing a superior user experience for developers, Phind-70B demonstrates its efficacy in real-world code generation tasks, offering detailed code examples with less hesitancy than its predecessor.

How V-JEPA uses self-supervised learning to predict | Pascal A. Miserez posted on the topic | LinkedIn

“V-JEPA: The next step toward Yann LeCun’s vision of advanced machine intelligence (AMI)” blog post by Meta: READ

The Video Joint Embedding Predictive Architecture (V-JEPA) model, based on Yann LeCun's vision for human-like AI, marks a significant advancement in machine intelligence by excelling in detecting and understanding complex interactions in the physical world. Released under a Creative Commons NonCommercial license, V-JEPA aims to foster a more grounded understanding of the world, enabling machines to achieve generalized reasoning and planning akin to human learning and adaptation.

OpenAI Blog: Building an early warning system for LLM-aided biological threat creation - Community - OpenAI Developer Forum

“Building an early warning system for LLM-aided biological threat creation” blog post by OpenAI: READ

In an effort to enhance AI-enabled safety risk evaluation methods, particularly concerning biological risks, a study involving 100 participants, including biology experts and students, was conducted to assess the potential for AI systems like GPT-4 to increase access to dangerous information on biological threat creation. The findings showed mild uplifts in accuracy and completeness for tasks related to biological threat creation among those with access to GPT-4 compared to an internet-only control group, but these increases were not statistically significant, underscoring the need for further research to determine meaningful risk levels.

Gpt-4-0125-preview: Is ChatGPT Still Lazy at Coding? (with Benchmarks)

“OpenAI new embedding models and API updates” blog post by OpenAI: READ

OpenAI is launching new models, including two new embedding models, updated previews of GPT-4 Turbo and GPT-3.5 Turbo, and an updated text moderation model, alongside reducing prices for GPT-3.5 Turbo and introducing enhanced developer tools for API key management and usage insights. Additionally, by default, data transmitted to the OpenAI API will not contribute to the training or enhancement of OpenAI's models.

Meet Eagle: The Low-Cost, High-Performing Multilingual Model

“Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5)” blog post by Eugene Cheah: READ

Eagle 7B, built on the RWKV-v5 architecture, is celebrated as the most environmentally friendly 7B model, boasting the lowest inference cost in its class and trained on over 1.1 trillion tokens in more than 100 languages. This "Attention-Free Transformer" foundation model matches or exceeds the performance of other 7B class models across multi-lingual benchmarks and approaches the performance of larger models in English evaluations, available for both personal and commercial use under the Apache 2.0 license through the Linux Foundation.

OpenAI CEO Sam Altman's Ambitious $7 Trillion Plan to Make AI Chips is Foggy on Details - MySmartPrice

“OpenAI CEO Sam Altman seeks as much as $7 trillion for new AI chip project: Report” article by CNBC: READ

OpenAI CEO Sam Altman is seeking up to $7 trillion in investments to significantly expand global semiconductor production, addressing the acute shortage of AI chips that hampers OpenAI's growth. His ambitious plan, which includes discussions with various investors such as the UAE government, aims to build massive-scale AI infrastructure to strengthen economic competitiveness and ensure a resilient supply chain for the burgeoning demand in generative AI technologies.

Understanding Deepfake: Creation to Detection

“Deepfake scammer walks off with $25 million in first-of-its-kind AI heist” article by arstechnica: READ

A multinational company's Hong Kong office was defrauded of HK$200 million (US$25.6 million) through a sophisticated deepfake scam, where scammers used AI to mimic the company's CFO and other employees in a video call, tricking an employee into transferring funds. This incident, the first of its magnitude in Hong Kong involving deepfake technology in a multi-person video conference scam, highlights the growing challenge of distinguishing authentic from fabricated digital content.

READ (RESEARCH PAPERS) 📚

“Gemma: Open Models Based on Gemini Research and Technology” research paper by Google DeepMind: READ

Gemma introduces a new family of lightweight, state-of-the-art open models that excel in language understanding, reasoning, and safety, outperforming other models of similar size on the majority of evaluated text-based tasks. With the responsible release of these models, in both 2 billion and 7 billion parameter sizes, the initiative aims to enhance the safety of frontier models and fuel further innovations in large language models (LLMs).

“Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context” research paper by Google DeepMind: READ

Gemini 1.5 Pro, the latest in the Gemini family, is a compute-efficient multimodal mixture-of-experts model with unparalleled capability in recalling and reasoning across vast contexts, including texts, videos, and audios. It sets new benchmarks in long-context retrieval, QA tasks, and ASR, with near-perfect recall and performance surpassing previous models, demonstrating groundbreaking abilities in language translation, even for languages with minimal speakers.

“Revisiting Feature Prediction for Learning Visual Representations from Video” research paper by Meta AI: READ

The paper introduces V-JEPA, a series of vision models trained on 2 million videos without conventional supervision methods, focusing on feature prediction for unsupervised learning. These models achieve versatile visual representations, excelling in motion and appearance-based tasks across multiple benchmarks, demonstrating the efficacy of learning by predicting video features.

“Suppressing Pink Elephants with Direct Principle Feedback” research paper by synthlabs.ai: READ

The paper presents a novel approach, Direct Principle Feedback (DPF), for controlling language models at inference time, allowing them to adapt to diverse contexts by directly applying feedback on critiques and revisions, illustrated through the "Pink Elephant Problem." After DPF fine-tuning, their 13B LLaMA 2 model significantly outperforms standard LLaMA and a prompted baseline, achieving parity with GPT-4 on tests related to avoiding specific entities while focusing on preferred topics.

“Efficient Multimodal Learning from Data-centric Perspective” research paper by Beijing Academy of Artificial Intelligence: READ

The paper introduces Bunny, a new family of lightweight Multimodal Large Language Models (MLLMs) that outperforms larger MLLMs by utilizing more informative training data for efficient multimodal learning. Despite their smaller size, Bunny models, particularly Bunny-3B, surpass the performance of state-of-the-art MLLMs like LLaVA-v1.5-13B on various benchmarks, addressing the issue of computational cost that limits wider deployment.

“YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information” research paper by Chien-Yao Wang: READ

This paper introduces the concept of Programmable Gradient Information (PGI) to address the issue of data loss in deep learning networks through layer-by-layer feature extraction and proposes a new lightweight network architecture, Generalized Efficient Layer Aggregation Network (GELAN), designed around gradient path planning. The implementation of PGI and GELAN, tested on the MS COCO dataset for object detection, demonstrates superior parameter utilization and performance, even outperforming state-of-the-art methods that rely on depth-wise convolution, highlighting the potential of PGI to enhance model training from scratch across various model sizes.

“FiT: Flexible Vision Transformer for Diffusion Model” research paper by Zeyu Lu: READ

The Flexible Vision Transformer (FiT) is introduced to address the limitations of existing diffusion models by generating images of unrestricted resolutions and aspect ratios, conceptualizing images as sequences of dynamically-sized tokens for adaptable training. This innovative approach, enhanced by a tailored network structure and training-free extrapolation techniques, demonstrates FiT's superior performance in generating high-quality images across a wide range of resolutions, far exceeding traditional models' capabilities.

WATCH🎥

LEARN 📚

“Byte Pair Encoding (BPE) algorithm” github repository by Andrej Karpathy: READ

“Mistral Cookbook” github repository by Mistral: READ

COOL THINGS 😎

ChatGPT system prompt 🚀

Groq fastest LLM platform using Language Processing units

TOOLS 🤖

MetaVoice - text-to-speech technology, offering a platform to experiment with various voice styles and settings.

Sintra AI - automating tasks and processes, aiming to fuel business growth through the use of AI prompts and automation bots.

Adventure AI - educational platform designed to teach kids real AI skills through a social game.

AI connections newsletter

Discussion about this post