Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I hope you have the appropriate limits in place. I one copied the transcript of a youtube video (podcast 5 hrs long) to chat with it in gpt4 api playground and after 3 chats it used up all my credits ($5).


Yeah that's a scary thing! Hopefully the price cap doesn't get hit.

Also the way its built, you can chat with a 5 hour video without breaking the bank because only relavant chunks of that video will be passed as context based on the question


How did you decide the chunking size? I’m working on a similar project and it seems our sweet spot was around 800 . But would really love to hear what others are doing here for RAG chunk sizes.


Context size is a huge limitation with current LLM design, but there are alread few open-source attempts at compressing LLM input/output to reduce costs.


I'm using the tiny mistral 7b for those 1hr long transcripts internal in my company. I was surprised that even the quantized 7b version easily chomped my 3090's vram - the context takes a lot. I think it goes up to 32k tokens (I go up to 20k). It hallucinates once every few sentences, but it's surprisingly a non-issue for my use cases (mostly for automated meeting notes, where I'm going through material anyway). 60 T/s is also great.

EDIT: of course GPT4 blows mistral out of the water for those very specific "needle in a haystack" or "sharp deductive reasoning needed" cases. Sometimes it makes people go wow, when I present that


Which 7B model are you using? 4bit I assume?


This specifically is:

TheBloke / Mistral-7B-Instruct-v0.2-GGUF / mistral-7b-instruct-v0.2.Q8_0.gguf

and I'm running it in LMStudio with the config:

  {
  "name": "Exported from LM Studio on 21.12.2023, 14:57:43",
  "load_params": {
    "n_ctx": 32768,
    "n_batch": 512,
    "rope_freq_base": 1000000,
    "rope_freq_scale": 1,
    "n_gpu_layers": 100,
    "use_mlock": true,
    "main_gpu": 0,
    "tensor_split": [
      0
    ],
    "seed": -1,
    "f16_kv": true,
    "use_mmap": true
  },
  "inference_params": {
    "n_threads": 4,
    "n_predict": -1,
    "top_k": 40,
    "top_p": 0.95,
    "temp": 0.2,
    "repeat_penalty": 1.1,
    "input_prefix": "[INST]",
    "input_suffix": "[/INST]",
    "antiprompt": [
      "[INST]"
    ],
    "pre_prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
    "pre_prompt_suffix": "",
    "pre_prompt_prefix": "",
    "seed": -1,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "n_keep": 0,
    "logit_bias": {},
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "memory_f16": true,
    "multiline_input": false,
    "penalize_nl": true
  }
  }




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: