2. Split text into chunks part 1

### Goals - [x] #NitroDigest read and answer in discussion about OOP: [https://github.com/Frodigo/garage/discussions/64] - [x] #NitroDigest research about chunked prompts [https://github.com/Frodigo/garage/issues/97] - [x] #Garage add info about NitroDigest project to the now page ### Notes - Worked on [this issue](https://github.com/Frodigo/garage/issues/97) and tried to create a working prototype with Cursor, but it was not able to handle this #AIwillNotReplaceMeTooFast - it forces using tiktoken to count tokens in texts but it is not a good option for Llama models - boilerplate code was ok, but chunks were not created correctly - it tries stupid solution like: adding "safety margin with 1500 tokens to be sure that chunk size will be ok" - next steps and open questions: - is there any library that can help with counting tokens for Llama models? - implement code responsible for create chunks. - Added diagrams to the issue ### Challenges & solutions - #NitroDigest prompt that is sent to Ollama, often is to long and Ollama truncates it which has an impact on summary quality - solution: split text into chunks - problem 1: - how to count tokens in text? - tiktoken - but it's designed for OpenAI models, not for Olama - solution: for now I added simple code that count tokens based on model configuration. - problem 2: - how to split text into sentences? - why? - I need to create chunks and do not split text in the middle of word - Chunk needs to make sense - solution: NLTK ### Useful snippets & resources - [https://github.com/openai/tiktoken](https://github.com/openai/tiktoken) - NLTK: - [https://github.com/nltk/nltk](https://github.com/nltk/nltk) - [https://www.nltk.org](https://www.nltk.org)