• The Octonaut@mander.xyz
    link
    fedilink
    English
    arrow-up
    24
    ·
    23 hours ago

    the accepted terminology

    No, it isn’t. The OSI specifically requires the training data be available or at very least that the source and fee for the data be given so that a user could get the same copy themselves. Because that’s the purpose of something being “open source”. Open source doesn’t just mean free to download and use.

    https://opensource.org/ai/open-source-ai-definition

    Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.

    In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.

    As per their paper, DeepSeek R1 required a very specific training data set because when they tried the same technique with less curated data, they got R"zero’ which basically ran fast and spat out a gibberish salad of English, Chinese and Python.

    People are calling DeepSeek open source purely because they called themselves open source, but they seem to just be another free to download, black-box model. The best comparison is to Meta’s LlaMa, which weirdly nobody has decided is going to up-end the tech industry.

    In reality “open source” is a terrible terminology for what is a very loose fit when basically trying to say that anyone could recreate or modify the model because they have the exact ‘recipe’.