OpenAI Researchers Find That Even the Best AI Is "Unable To Solve the Majority" of Coding Problems

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 6 hours ago

OpenAI Researchers Find That Even the Best AI Is "Unable To Solve the Majority" of Coding Problems

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 6 hours ago

Looks like we’re finally reaching the end of the hype phase of AI coding tools. My experience is that reasoning models like R1, are able to produce useful code fairly consistently now, but they’re not a replacement for a human. They’re a tool a human can use to speed up their workflow.

beleza pura@lemmy.eco.br · edit-2 6 hours ago

even then i’d be extremely cautious about using ai code bc what’s worse than generating broken code is generating code that looks convincing and runs “properly” but is incorrect. in my experience, this is the worst problem about llms in general: very convincing but wrong answers

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 6 hours ago

The trick is that you don’t want to generate a lot of code all at once. I tend to use it for making specific functions, or doing boring tasks like creating SQL schemas based on JSON input. I can validate these tasks pretty easily, but it saves me a lot of time looking stuff up or writing boilerplate.

I also find that different languages also work better in this context. For example, I primarily work with Clojure and it’s a functional language where data is immutable. The contract for most functions is a plain data structure without any state or mutable references to the outside. This makes it easy to test functions in isolation to ensure they’re doing what’s intended. For example, I had DeepSeek write a function to extract image urls from Markdown for me just the other day.

(defn extract-image-links
  "Extracts all image links including reference-style images"
  [markdown]
  (let [;; Extract reference definitions
        ref-regex #"(?m)^\[([^\]]+)\]:\s+(https?://\S+)$"
        refs (into {} (map (fn [[_ ref url]] [ref url])
                           (re-seq ref-regex markdown)))
        ;; Extract all image tags
        image-regex #"!\[([^\]]*)\]\(([^\)]+)\)|!\[([^\]]*)\]\[([^\]]*)\]"
        matches (re-seq image-regex markdown)
        ;; Process matches
        urls (keep (fn [[_ _ inline-url _ ref]]
                     (cond
                       inline-url (string/trim inline-url)
                       ref (get refs (string/trim ref))))
                   matches)]

    (set urls)))

It’s easy to follow code that’s fairly idiomatic, and I can easily test this function by passing some Markdown through it and seeing whether it’s giving me expected results. The most annoying part about writing it by hand would’ve been crafting the regex which DeepSeek managed to do correctly.

beleza pura@lemmy.eco.br · 5 hours ago

i’ll be honest, that regex gives me the chills, but i’m not a fan of regexes in the first place, so maybe that’s on me

anyway, maybe you’re right, but what still gives me pause is verifying the result is correct. maybe the llm is giving me something that works for the happy path but breaks in some corner case for some input i didn’t think of. unit tests only get you so far in these cases

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 4 hours ago

It’s precisely the kind of thing I would not want to figure out by hand, but it’s fairly easy to validate that it does what I want it to. The use case here is well defined, and it’s pretty easy to write tests for the expected behaviors. In general, I like to treat tests as contracts for the functionality of the app.

Meanwhile, the same argument would apply to hand written regex as well. People write code that works on a happy path, but breaks in some corner case all the time. This is a common occurrence without LLMs.

Progenitor, Benevolent-I@dobbs.town · 5 hours ago

@yogthos

Now for the valley of productivity…

I’d guess we’ve seen 5% or less of the value these tools can bring even without more breakthroughs in the enabling technologies.

underisk@lemmy.ml · 4 hours ago

What aspect of it do you think users are deficient in that they’re missing 95% of its potential? Prompt engineering?

Progenitor, Benevolent-I@dobbs.town · 4 hours ago

deleted by creator

Progenitor, Benevolent-I@dobbs.town · 4 hours ago

@underisk

The IDE integration is clunky… The languages haven’t started to adapt themselves to be more amiable to AI driven development… There are certainly interfaces which haven’t been explored or if they have been explored haven’t ‘clicked’…

Just some proposals, it’s normal for a _lot_ of value to be discovered after the hype calms down a bit.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 5 hours ago

My experience is much better, these tools basically replaced stuff like stack overflow for me, and save me a ton of time writing boilerplate.