Introduction to prompt design

Prompt design is the process of creating prompts that elicit the desired response from language models. Writing well structured prompts is an essential part of ensuring accurate, high quality responses from a language model. This page introduces some basic concepts, strategies, and best practices to get you started in designing prompts.
What is a prompt

A prompt is a natural language request submitted to a language model to receive a response back. Prompts can contain questions, instructions, contextual information, examples, and partial input for the model to complete or continue. After the model receives a prompt, depending on the type of model being used, it can generate text, embeddings, code, images, videos, music, and more.
Prompt content types

Prompts can include one or more of the following types of content:

    Input (required)
    Context (optional)
    Examples (optional)

Input

An input is the text in the prompt that you want the model to provide a response for, and it's a required content type. Inputs can be a question that the model answers (question input), a task the model performs (task input), an entity the model operates on (entity input), or partial input that the model completes or continues (completion input).
Question input

A question input is a question that you ask the model that the model provides an answer to.
Prompt:

What's a good name for a flower shop that specializes in selling bouquets of
dried flowers? Create a list of 10 options with just the names.


Response:

    Here are 10 names for a flower shop specializing in dried flowers:

    1. Everlasting Blooms
    2. Dried & Delightful
    3. The Petal Preserve
    4. Whispers of Wildflowers
    5. Forever Blooms
    6. The Botanical Muse
    7. Timeless Blooms
    8. Dried Flower Co.
    9. Bloom & Co.
    10. The Sun-Kissed Petals


(gemini-1.5-flash)
Task input

A task input is a task that you want the model to perform. For example, you can tell the model to give you ideas or suggestions for something.
Prompt:

Give me a simple list of just the things that I must bring on a camping trip. The list should have 10 items.


Response:

    Here's a list of 10 essential items for a camping trip:

    1. **Tent:** Your shelter from the elements.
    2. **Sleeping Bag:** Stay warm and comfortable.
    3. **Sleeping Pad:** Insulation and cushioning for a good night's sleep.
    4. **Headlamp or Flashlight:** Light for nighttime activities.
    5. **First Aid Kit:** Be prepared for minor injuries.
    6. **Water Bottle/Hydration Pack:** Stay hydrated.
    7. **Food and Snacks:** Fuel for your adventures.
    8. **Cooking Gear:** Camp stove, fuel, utensils, etc.
    9. **Knife and Multitool:** Essential for various tasks.
    10. **Camp Chair:** Enjoy your campfire and meals comfortably.


(gemini-1.5-flash)
Entity input

An entity input is what the model performs an action on, such as classify or summarize. This type of input can benefit from the inclusion of instructions.
Prompt:

Classify the following items as [large, small].
Elephant
Mouse
Snail


Response:

    Here's the classification:

    * **Elephant:** Large
    * **Mouse:** Small
    * **Snail:** Small


(gemini-1.5-flash)
Completion input

A completion input is text that the model is expected to complete or continue.
Prompt:

    Complete the following prompt: Some simple strategies for overcoming writer's block include...


Response:

    Some simple strategies for overcoming writer's block include:

    * **Freewriting:** Write non-stop for a set amount of time (5-10 minutes) without worrying about grammar, spelling, or even making sense. Just keep your pen moving and let your thoughts flow freely. This can help you loosen up and access ideas you might not have considered otherwise.
    * **Brainstorming:** Make a list of everything that comes to mind related to your topic, no matter how silly or unrelated it seems. You can also use mind mapping or other visual brainstorming techniques.
    * **Changing your environment:** Sometimes a change of scenery can be all you need to get the creative juices flowing. Go for a walk, work in a coffee shop, or even just move to a different room in your house.
    * **Reading:** Immerse yourself in the work of other writers, especially those who inspire you. Pay attention to their style, their structure, and their use of language.
    * **Taking a break:** Sometimes the best way to overcome writer's block is to simply step away from the project for a while. Go for a walk, watch a movie, listen to music, or do something else that you enjoy. You might be surprised at how much clearer your head feels when you return to your writing.
    * **Focusing on a single element:** Instead of tackling the entire project, focus on a single sentence, paragraph, or scene. Once you get something down on paper, it can be easier to build from there.
    * **Starting with the ending:** If you're stuck on the beginning, try writing the ending first. This can help you to clarify your overall vision for the project and provide a framework for your ideas.
    * **Talking it out:** Share your ideas and challenges with a friend, a writing group, or a writing coach. Discussing your project with someone else can help you to see it from a new perspective and find solutions to your writing problems.

    Remember, writer's block is a common experience, and it's not a sign that you're a bad writer. Just be patient with yourself, try some of these strategies, and keep writing!


(gemini-1.5-flash)
Context

Context can be one of the following:

    Instructions that specify how the model should behave.
    Information that the model uses or references to generate a response.

Add contextual information in your prompt when you need to give information to the model, or restrict the boundaries of the responses to only what's within the prompt.
Prompt:

Marbles:
Color: red
Number: 12
Color: blue
Number: 28
Color: yellow
Number: 15
Color: green
Number: 17

How many green marbles are there?


Response:

There are 17 green marbles.


(gemini-1.5-flash)
Examples

Examples are input-output pairs that you include in the prompt to give the model an example of an ideal response. Including examples in the prompt is an effective strategy for customizing the response format.
Prompt:

Classify the following.
Options:
- red wine
- white wine

Text: Chardonnay
The answer is: white wine
Text: Cabernet
The answer is: red wine
Text: Moscato
The answer is: white wine

Text: Riesling
The answer is:


Response:

The answer is: white wine


(gemini-1.5-flash)
Next steps

    Now that you have an understanding of prompt design, try writing your own prompts using Google AI Studio.
    For a deeper understanding of prompt design, see the prompt strategies topic.
    To learn about multimodal prompting, see Prompting with media files.

-------------------------------------------------

Long context

Gemini 1.5 Flash comes standard with a 1-million-token context window, and Gemini 1.5 Pro comes with a 2-million-token context window. Historically, large language models (LLMs) were significantly limited by the amount of text (or tokens) that could be passed to the model at one time. The Gemini 1.5 long context window, with near-perfect retrieval (>99%), unlocks many new use cases and developer paradigms.

The code you already use for cases like text generation or multimodal inputs will work out of the box with long context.

Throughout this guide, you briefly explore the basics of the context window, how developers should think about long context, various real world use cases for long context, and ways to optimize the usage of long context.

What is a context window?

The basic way you use the Gemini 1.5 models is by passing information (context) to the model, which will subsequently generate a response. An analogy for the context window is short term memory. There is a limited amount of information that can be stored in someone's short term memory, and the same is true for generative models.

You can read more about how models work under the hood in our generative models guide.

Getting started with long context

Most generative models created in the last few years were only capable of processing 8,000 tokens at a time. Newer models pushed this further by accepting 32,000 tokens or 128,000 tokens. Gemini 1.5 is the first model capable of accepting 1 million tokens, and now 2 million tokens with Gemini 1.5 Pro.

In practice, 1 million tokens would look like:

50,000 lines of code (with the standard 80 characters per line)
All the text messages you have sent in the last 5 years
8 average length English novels
Transcripts of over 200 average length podcast episodes

Even though the models can take in more and more context, much of the conventional wisdom about using large language models assumes this inherent limitation on the model, which as of 2024, is no longer the case.

Some common strategies to handle the limitation of small context windows included:

Arbitrarily dropping old messages / text from the context window as new text comes in
Summarizing previous content and replacing it with the summary when the context window gets close to being full
Using RAG with semantic search to move data out of the context window and into a vector database
Using deterministic or generative filters to remove certain text / characters from prompts to save tokens

While many of these are still relevant in certain cases, the default place to start is now just putting all of the tokens into the context window. Because Gemini 1.5 models were purpose-built with a long context window, they are much more capable of in-context learning. For example, with only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro and Gemini 1.5 Flash are capable of learning to translate from English to Kalamang— a Papuan language with fewer than 200 speakers and therefore almost no online presence—with quality similar to a person who learned from the same materials.

This example underscores how you can start to think about what is possible with long context and the in-context learning capabilities of Gemini 1.5.

Long context use cases

While the standard use case for most generative models is still text input, the Gemini 1.5 model family enables a new paradigm of multimodal use cases. These models can natively understand text, video, audio, and images. They are accompanied by the Gemini API that takes in multimodal file types for convenience.

Long form text

Text has proved to be the layer of intelligence underpinning much of the momentum around LLMs. As mentioned earlier, much of the practical limitation of LLMs was because of not having a large enough context window to do certain tasks. This led to the rapid adoption of retrieval augmented generation (RAG) and other techniques which dynamically provide the model with relevant contextual information. Now, with larger and larger context windows (currently up to 2 million on Gemini 1.5 Pro), there are new techniques becoming available which unlock new use cases.

Some emerging and standard use cases for text based long context include:

Summarizing large corpuses of text
- Previous summarization options with smaller context models would require a sliding window or another technique to keep state of previous sections as new tokens are passed to the model
Question and answering
- Historically this was only possible with RAG given the limited amount of context and models' factual recall being low
Agentic workflows
- Text is the underpinning of how agents keep state of what they have done and what they need to do; not having enough information about the world and the agent's goal is a limitation on the reliability of agents

Many-shot in-context learning is one of the most unique capabilities unlocked by long context models. Research has shown that taking the common "single shot" or "multi-shot" example paradigm, where the model is presented with one or a few examples of a task, and scaling that up to hundreds, thousands, or even hundreds of thousands of examples, can lead to novel model capabilities. This many-shot approach has also been shown to perform similarly to models which were fine-tuned for a specific task. For use cases where a Gemini model's performance is not yet sufficient for a production rollout, you can try the many-shot approach. As you might explore later in the long context optimization section, context caching makes this type of high input token workload much more economically feasible and even lower latency in some cases.

Long form video

Video content's utility has long been constrained by the lack of accessibility of the medium itself. It was hard to skim the content, transcripts often failed to capture the nuance of a video, and most tools don't process image, text, and audio together. With Gemini 1.5, the long-context text capabilities translate to the ability to reason and answer questions about multimodal inputs with sustained performance. Gemini 1.5 Flash, when tested on the needle in a video haystack problem with 1M tokens, obtained >99.8% recall of the video in the context window, and 1.5 Pro reached state of the art performance on the Video-MME benchmark.

Some emerging and standard use cases for video long context include:

Video question and answering
Video memory, as shown with Google's Project Astra
Video captioning
Video recommendation systems, by enriching existing metadata with new multimodal understanding
Video customization, by looking at a corpus of data and associated video metadata and then removing parts of videos that are not relevant to the viewer
Video content moderation
Real-time video processing

When working with videos, it is important to consider how the videos are processed into tokens, which affects billing and usage limits. You can learn more about prompting with video files in the Prompting guide.

Long form audio

The Gemini 1.5 models were the first natively multimodal large language models that could understand audio. Historically, the typical developer workflow would involve stringing together multiple domain specific models, like a speech-to-text model and a text-to-text model, in order to process audio. This led to additional latency required by performing multiple round-trip requests and decreased performance usually attributed to disconnected architectures of the multiple model setup.

On standard audio-haystack evaluations, Gemini 1.5 Pro is able to find the hidden audio in 100% of the tests and Gemini 1.5 Flash is able to find it in 98.7% of the tests. Gemini 1.5 Flash accepts up to 9.5 hours of audio in a single request and Gemini 1.5 Pro can accept up to 19 hours of audio using the 2-million-token context window. Further, on a test set of 15-minute audio clips, Gemini 1.5 Pro archives a word error rate (WER) of ~5.5%, much lower than even specialized speech-to-text models, without the added complexity of extra input segmentation and pre-processing.

Some emerging and standard use cases for audio context include:

Real-time transcription and translation
Podcast / video question and answering
Meeting transcription and summarization
Voice assistants

You can learn more about prompting with audio files in the Prompting guide.

Long context optimizations

The primary optimization when working with long context and the Gemini 1.5 models is to use context caching. Beyond the previous impossibility of processing lots of tokens in a single request, the other main constraint was the cost. If you have a "chat with your data" app where a user uploads 10 PDFs, a video, and some work documents, you would historically have to work with a more complex retrieval augmented generation (RAG) tool / framework in order to process these requests and pay a significant amount for tokens moved into the context window. Now, you can cache the files the user uploads and pay to store them on a per hour basis. The input / output cost per request with Gemini 1.5 Flash for example is ~4x less than the standard input / output cost, so if the user chats with their data enough, it becomes a huge cost saving for you as the developer.

Long context limitations

In various sections of this guide, we talked about how Gemini 1.5 models achieve high performance across various needle-in-a-haystack retrieval evals. These tests consider the most basic setup, where you have a single needle you are looking for. In cases where you might have multiple "needles" or specific pieces of information you are looking for, the model does not perform with the same accuracy. Performance can vary to a wide degree depending on the context. This is important to consider as there is an inherent tradeoff between getting the right information retrieved and cost. You can get ~99% on a single query, but you have to pay the input token cost every time you send that query. So for 100 pieces of information to be retrieved, if you needed 99% performance, you would likely need to send 100 requests. This is a good example of where context caching can significantly reduce the cost associated with using Gemini models while keeping the performance high.

FAQs

Do I lose model performance when I add more tokens to a query?

Generally, if you don't need tokens to be passed to the model, it is best to avoid passing them. However, if you have a large chunk of tokens with some information and want to ask questions about that information, the model is highly capable of extracting that information (up to 99% accuracy in many cases).

How does Gemini 1.5 Pro perform on the standard needle-in-a-haystack test?

Gemini 1.5 Pro achieves 100% recall up to 530k tokens and >99.7% recall up to 1M tokens.

How can I lower my cost with long-context queries?

If you have a similar set of tokens / context that you want to re-use many times, context caching can help reduce the costs associated with asking questions about that information.

How can I get access to the 2-million-token context window?

All developers now have access to the 2-million-token context window with Gemini 1.5 Pro.

Does the context length affect the model latency?

There is some fixed amount of latency in any given request, regardless of the size, but generally longer queries will have higher latency (time to first token).

Do the long context capabilities differ between Gemini 1.5 Flash and Gemini 1.5 Pro?

Yes, some of the numbers were mentioned in different sections of this guide, but generally Gemini 1.5 Pro is more performant on most long context use cases.

-----------------------------------------

About generative models

Generative artificial intelligence (AI) models such as the Gemini family of models are able to create content from varying types of data input, including text, images, and audio. At their most basic level, these models operate like sophisticated autocomplete applications. Given input text ("You can lead a horse to water,"), a generative model can predict that the output text that's statistically likely to follow ("but you can't make it drink"), based on patterns learned from their training data. You can use this basic property of generative models for various applications:

Generate creative writing such as poetry, short stories, metaphors, and blog posts
Convert structured data to freeform text
Extract and summarize information from freeform text
Generate code
Translate between languages

Using generative models to build these features doesn't require any machine learning (ML) expertise. You don't need to collect a large dataset or train a model. All it takes to start your first prototype is to describe what you want the model to do in a few sentences.

Generative models that only handle text input and output are also known as large language models (LLMs). The Gemini family of models can process more than text data, and while they can perform the functions of an LLM, they are more properly known as generative models.

Example applications

The following examples show typical text inputs to, and outputs from, a generative model.

Generate a poem

User input: Write me a four-line poem about puppies and Android phones. Make sure it rhymes.

Model output:

Puppies are cute,
Android phones are neat,
They both bring joy,
And make life complete.

Generate a list

User input: Generate a bulleted list of items I need to pack for a three-day camping trip.

Model output:

*   Tent
*   Sleeping bag
*   Sleeping pad
*   Camp stove
*   Pots and pans
*   Utensils
*   Food
*   Water
*   First-aid kit
*   Sunscreen
*   Insect repellent
*   ...

You can get generative models to produce all sorts of useful behaviors like this, just by crafting the right input text, also called a prompt. The art and science of figuring out the right wording to get generative models to do what you want is called prompt design (also called "prompt engineering" or simply "prompting").

Prompt design 101

The previous section showed some examples of prompts that contain an instruction, like 'Write me a poem'. This kind of instruction may work well for certain types of tasks. However, for other applications, another prompting technique called few-shot prompting may work better. Few-shot prompts take advantage of the fact that large language models are incredibly good at recognizing and replicating patterns in text data. The idea is to send the generative model a text pattern that it learns to complete. For example, say you want to build an application that takes as input a country name and outputs its capital city. Here's a text prompt designed to do just that:

Italy : Rome
France : Paris
Germany :

In this prompt, you establish a pattern: [country] : [capital]. If you send this prompt to a large language model, it will autocomplete the pattern and return something like this:

     Berlin
Turkey : Ankara
Greece : Athens

This model response may look a little strange. The model returned not only the capital of Germany (the last country in your hand-written prompt), but also a whole list of additional country and capital pairs. That's because the generative model is "continuing the pattern." If all you're trying to do is build a function that tells you the capital of an input country ("Germany : Berlin"), you probably don't really care about any of the text the model generates after "Berlin." Indeed, as application designers, you'd probably want to truncate those extraneous examples. What's more, you'd probably want to parameterize the input, so that Germany is not a fixed string but a variable that the end user provides:

Italy : Rome
France : Paris
<user input here> :

You have just written a few-shot prompt for generating country capitals.

You can accomplish a large number of tasks by following this few-shot prompt template. Here's a few-shot prompt with a slightly different format that converts Python to JavaScript:

Convert Python to JavaScript.
Python: print("hello world")
JavaScript: console.log("hello world")
Python: for x in range(0, 100):
JavaScript: for(var i = 0; i < 100; i++) {
Python: ${USER INPUT HERE}
JavaScript:

Or, take this "reverse dictionary" prompt. Given a definition, it returns the word that fits that definition:

Given a definition, return the word it defines.
Definition: When you're happy that other people are also sad.
Word: schadenfreude
Definition: existing purely in the mind, but not in physical reality
Word: abstract
Definition: ${USER INPUT HERE}
Word:

You might have noticed that the exact pattern of these few-shot prompts varies slightly. In addition to containing examples, providing instructions in your prompts is an additional strategy to consider when writing your own prompts, as it helps to communicate your intent to the model.

Prompting versus traditional software development

Unlike traditional software that's designed to a carefully written spec, the behavior of generative models is largely opaque even to the model trainers. As a result, you often can't predict in advance what types of prompt structures will work best for a particular model. What's more, the behavior of a generative model is determined in large part by its training data, and since models are continually tuned on new datasets, sometimes the model changes enough that it inadvertently changes which prompt structures work best. What does this mean for you? Experiment! Try different prompt formats.

Model parameters

Every prompt you send to the model includes parameter values that control how the model generates a response. The model can generate different results for different parameter values. The most common model parameters are:

Max output tokens: Specifies the maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.
Temperature: The temperature controls the degree of randomness in token selection. The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Lower temperatures are good for prompts that require a more deterministic or less open-ended response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 is deterministic, meaning that the highest probability response is always selected.
topK: The topK parameter changes how the model selects tokens for output. A topK of 1 means the selected token is the most probable among all the tokens in the model's vocabulary (also called greedy decoding), while a topK of 3 means that the next token is selected from among the 3 most probable using the temperature. For each token selection step, the topK tokens with the highest probabilities are sampled. Tokens are then further filtered based on topP with the final token selected using temperature sampling.
topP: The topP parameter changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals the topP value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the topP value is 0.5, then the model will select either A or B as the next token by using the temperature and exclude C as a candidate. The default topP value is 0.95.
stop_sequences: Set a stop sequence to tell the model to stop generating content. A stop sequence can be any sequence of characters. Try to avoid using a sequence of characters that may appear in the generated content.

Types of prompts

Depending on the level of contextual information contained in them, prompts are broadly classified into three types.

Zero-shot prompts

These prompts don't contain examples for the model to replicate. Zero-shot prompts essentially show the model's ability to complete the prompt without any additional examples or information. It means the model has to rely on its pre-existing knowledge to generate a plausible answer.

Some commonly used zero-shot prompt patterns are:

Instruction-content

<Overall instruction>
<Content to operate on>

For example,

Summarize the following into two sentences at the third-grade level:

Hummingbirds are the smallest birds in the world, and they are also one of the
most fascinating. They are found in North and South America, and they are known
for their long, thin beaks and their ability to fly at high speeds.

Hummingbirds are made up of three main parts: the head, the body, and the tail.
The head is small and round, and it contains the eyes, the beak, and the brain.
The body is long and slender, and it contains the wings, the legs, and the
heart. The tail is long and forked, and it helps the hummingbird to balance
while it is flying.

Hummingbirds are also known for their coloration. They come in a variety of
colors, including green, blue, red, and purple. Some hummingbirds are even able
to change their color!

Hummingbirds are very active creatures. They spend most of their time flying,
and they are also very good at hovering. Hummingbirds need to eat a lot of food
in order to maintain their energy, and they often visit flowers to drink nectar.

Hummingbirds are amazing creatures. They are small, but they are also very
powerful. They are beautiful, and they are very important to the ecosystem.

Instruction-content-instruction

<Overall instruction or context setting>
<Content to operate on>
<Final instruction>

For example,

Here is some text I'd like you to summarize:

Hummingbirds are the smallest birds in the world, and they are also one of the
most fascinating. They are found in North and South America, and they are known
for their long, thin beaks and their ability to fly at high speeds. Hummingbirds
are made up of three main parts: the head, the body, and the tail. The head is
small and round, and it contains the eyes, the beak, and the brain. The body is
long and slender, and it contains the wings, the legs, and the heart. The tail
is long and forked, and it helps the hummingbird to balance while it is flying.
Hummingbirds are also known for their coloration. They come in a variety of
colors, including green, blue, red, and purple. Some hummingbirds are even able
to change their color! Hummingbirds are very active creatures. They spend most
of their time flying, and they are also very good at hovering. Hummingbirds need
to eat a lot of food in order to maintain their energy, and they often visit
flowers to drink nectar. Hummingbirds are amazing creatures. They are small, but
they are also very powerful. They are beautiful, and they are very important to
the ecosystem.

Summarize it in two sentences at the third-grade reading level.

Continuation. Sometimes, you can have the model continue text without any instructions. For example, here is a zero-shot prompt where the model is intended to continue the input provided:

Once upon a time, there was a little sparrow building a nest in a farmer's
barn. This sparrow

Use zero-shot prompts to generate creative text formats, such as poems, code, scripts, musical pieces, email, or letters.

One-shot prompts

These prompts provide the model with a single example to replicate and continue the pattern. This allows for the generation of predictable responses from the model.

For example, you can generate food pairings like:

Food: Apple
Pairs with: Cheese
Food: Pear
Pairs with:

Few-shot prompts

These prompts provide the model with multiple examples to replicate. Use few-shot prompts to complete complicated tasks, such as synthesizing data based on a pattern.

An example prompt may be:

Generate a grocery shopping list for a week for one person. Use the JSON format
given below.
{"item": "eggs", "quantity": "6"}
{"item": "bread", "quantity": "one loaf"}

Generative models under the hood

This section aims to answer the question - Is there randomness in generative models' responses, or are they deterministic?

The short answer - yes to both. When you prompt a generative model, a text response is generated in two stages. In the first stage, the generative model processes the input prompt and generates a probability distribution over possible tokens (words) that are likely to come next. For example, if you prompt with the input text "The dog jumped over the ... ", the generative model will produce an array of probable next words:

[("fence", 0.77), ("ledge", 0.12), ("blanket", 0.03), ...]

This process is deterministic; a generative model will produce this same distribution every time it's input the same prompt text.

In the second stage, the generative model converts these distributions into actual text responses through one of several decoding strategies. A simple decoding strategy might select the most likely token at every timestep. This process would always be deterministic. However, you could instead choose to generate a response by randomly sampling over the distribution returned by the model. This process would be stochastic (random). Control the degree of randomness allowed in this decoding process by setting the temperature. A temperature of 0 means only the most likely tokens are selected, and there's no randomness. Conversely, a high temperature injects a high degree of randomness into the tokens selected by the model, leading to more unexpected, surprising model responses.

Salt Shaker Press

Search This Blog

Friday, September 27, 2024

Prompt design