Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat: refactor Ollama chat client #3881

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Chat: refactor Ollama chat client #3881

wants to merge 10 commits into from

Conversation

abeatrix
Copy link
Contributor

  1. This PR fixes the issue of using PromptString as prompt text in the messages we send to the LLM, causing regression in chat response from some Ollama models.

  2. Also updated the chat client for Ollama and Groq to have better error handling:
    image

  3. Log usage on complete for easier debugging purpose

Ollama:
image

Groq:
image

Test plan

  1. Run this branch in debug mode and follow our Ollama docs to set up Ollama with Cody
  2. Try asking Cody a question

Before

image

After

image

@abeatrix abeatrix requested review from dominiccooney and a team April 20, 2024 03:04
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell me more about the specific reason PromptString was regressing ollama prompts?

lib/shared/src/llm-providers/ollama/chat-client.ts Outdated Show resolved Hide resolved
lib/shared/src/llm-providers/ollama/chat-client.ts Outdated Show resolved Hide resolved
lib/shared/src/llm-providers/ollama/utils.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are still problems with ollama response decoding and this PR is on the right track. It just needs to go further in handling the incremental responses.

Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific feedback inline.

try {
const { done, value } = await reader.read()
if (typeof value === 'string') {
const parsedData = JSON.parse(value) as OllamaGenerateResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be fragile. We shouldn't assume that the network response is chunked on JSON boundaries, either. We should split on newlines, and then decode. The protocol is a bit like Server-Sent Events, "packets" are terminated by newlines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dominiccooney hey Dom sorry i haven't got a chance to work on this since I''m still working on my sprint works, but I will try to pick this back up tomorrow. Will mark it as a draft until this is ready again, appreciate your feedback

@abeatrix abeatrix marked this pull request as draft April 26, 2024 04:55
@abeatrix abeatrix changed the title Chat: Fix prompt issue in Ollama chat client Chat: refactor Ollama chat client Apr 30, 2024
@abeatrix
Copy link
Contributor Author

@dominiccooney i've applied your feedback to the PR:

  • Use a TextDecoderStream
  • Accumulate the decoded text into a string
  • Break on newlines
  • JSON.parse each chunk and act on it
  • Don't assume that, having seen the end of the network response, that you have decoded all the characters in the response yet; the last packet may contain data.
    • process if (value) where value will be decoded first, before checking for if (done)

@abeatrix abeatrix marked this pull request as ready for review April 30, 2024 00:30
// Splits the decoded chunk by the new lines and filters out empty strings.
if (value) {
for (const chunk of value.split(RESPONSE_SEPARATOR).filter(Boolean)) {
const line = JSON.parse(chunk) as OllamaGenerateResponse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the completions client, we should buffer the chunks in case they are not cut on JSON boundaries.

What do you think about reusing the response streaming/reading part from the completions client to keep this logic in one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valerybugakov yea that'd be great! I will update this PR to reuse the one you created!

@abeatrix abeatrix marked this pull request as draft May 20, 2024 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants