I hope you have the appropriate limits in place. I one copied the transcript of ...

yoeven · on Dec 20, 2023

Yeah that's a scary thing! Hopefully the price cap doesn't get hit.

Also the way its built, you can chat with a 5 hour video without breaking the bank because only relavant chunks of that video will be passed as context based on the question

social_quotient · on Dec 20, 2023

How did you decide the chunking size? I’m working on a similar project and it seems our sweet spot was around 800 . But would really love to hear what others are doing here for RAG chunk sizes.

3abiton · on Dec 20, 2023

Context size is a huge limitation with current LLM design, but there are alread few open-source attempts at compressing LLM input/output to reduce costs.

eurekin · on Dec 20, 2023

I'm using the tiny mistral 7b for those 1hr long transcripts internal in my company. I was surprised that even the quantized 7b version easily chomped my 3090's vram - the context takes a lot. I think it goes up to 32k tokens (I go up to 20k). It hallucinates once every few sentences, but it's surprisingly a non-issue for my use cases (mostly for automated meeting notes, where I'm going through material anyway). 60 T/s is also great.

EDIT: of course GPT4 blows mistral out of the water for those very specific "needle in a haystack" or "sharp deductive reasoning needed" cases. Sometimes it makes people go wow, when I present that

3abiton · on Dec 20, 2023

Which 7B model are you using? 4bit I assume?

eurekin · on Dec 21, 2023

This specifically is:

TheBloke / Mistral-7B-Instruct-v0.2-GGUF / mistral-7b-instruct-v0.2.Q8_0.gguf

and I'm running it in LMStudio with the config:

  {
  "name": "Exported from LM Studio on 21.12.2023, 14:57:43",
  "load_params": {
    "n_ctx": 32768,
    "n_batch": 512,
    "rope_freq_base": 1000000,
    "rope_freq_scale": 1,
    "n_gpu_layers": 100,
    "use_mlock": true,
    "main_gpu": 0,
    "tensor_split": [
      0
    ],
    "seed": -1,
    "f16_kv": true,
    "use_mmap": true
  },
  "inference_params": {
    "n_threads": 4,
    "n_predict": -1,
    "top_k": 40,
    "top_p": 0.95,
    "temp": 0.2,
    "repeat_penalty": 1.1,
    "input_prefix": "[INST]",
    "input_suffix": "[/INST]",
    "antiprompt": [
      "[INST]"
    ],
    "pre_prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
    "pre_prompt_suffix": "",
    "pre_prompt_prefix": "",
    "seed": -1,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "n_keep": 0,
    "logit_bias": {},
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "memory_f16": true,
    "multiline_input": false,
    "penalize_nl": true
  }
  }