Claude 3.7 SUPER CODER… With One Big Flaw?!

26 February 2025


Claude 3.7 SUPER CODER… With One Big Flaw?!



Claude 3.7 is here! It may be an excellent coder but it sure is lacking in other departments…
Have ideas for my new testing rubric? Let me know in the comments below!

Join My Newsletter for Regular AI Updates 👇🏼
https://forwardfuture.ai

My Links 🔗
👉🏻 Subscribe: https://www.youtube.com/@matthew_berman
👉🏻 Twitter: https://twitter.com/matthewberman
👉🏻 Discord: https://discord.gg/xxysSXBxFW
👉🏻 Patreon: https://patreon.com/MatthewBerman
👉🏻 Instagram: https://www.instagram.com/matthewberman_ai
👉🏻 Threads: https://www.threads.net/@matthewberman_ai
👉🏻 LinkedIn: https://www.linkedin.com/company/forward-future-ai

Media/Sponsorship Inquiries ✅
https://bit.ly/44TC45V

Links:
Article: https://www.anthropic.com/news/claude-3-7-sonnet
Claude Code: https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview

CLA 3.7 Sonic was just released and I just got done testing it I built a complex snake game that allowed two AI snakes to battle each other I added a superfood that creates a block that can destroy one of the snakes that actually moves and follows the snake around and all of this was done on the very first try and I’m going to show you more about that later in the video but now let me tell you a little bit about CLA 3.7 Sonic so two things were actually released just now we have clad 3.7 Sonet which is a big but still a DOT upgrade to the Claude series of models and then we also have clae code which is a command line interface for agentic coding now for Claude 3.7 Sonet it is a thinking model and this is the first thinking model by anthropic I am pretty surprised that this is not Claude 4 and I find it a little bit weird that this jump is from 3.5 to 3.7 versus just straight to four which makes me think four is in the works and it’s going to be much much better but we don’t know that for sure but what we do know is that this minor version increment is a big jump this is the first hybrid reasoning model on the market that means CLA 37 is both capable of generating near instant replies to whatever prompt you have in the more traditional llm way and it also has thinking so it can take its time using Chain of Thought before replying to you very similar to O 1 03 and grock 3 but both of those come from a single model now just like other thinking models Claud 3.7 has a scratch pad in which it’s doing Chain of Thought So it’s actually iterating on its thinking it’s reflecting it’s trying different potential results and then finally summarizing everything or kind of choosing the best one and then showing it to you and they actually do show The Chain of Thought which I thought was surprising because anthropic is kind of known for being really close s and very big on security now whether or not they’re actually showing the True full Chain of Thought I’m not actually sure but it kind of does look like they are and if you have API access there’s actually a dial in which you can tell Cloud 3.7 how long to think for and you can actually specify the number of tokens up to the context window maximum which is 128,000 tokens which is definitely on the smaller side of context windows so as an API user if you’re building API applications and you’re using Cloud 3.7 Sonet to power it all you do want to specify how many tokens maximum so that you just don’t blow your budget overnight let’s look at some of these results so this is sbench verified here is Claude 3.7 Sonet this is a 20% increase versus the other models listed here this is cloud 3.5 Sonet new 0103 Mini high and deep seek R1 all four of these models come in right around the 49% and then with Claude 3.7 Sonet we reach 70% now there’s a caveat here this kind of lighter pink area says with custom scaffolding that just means they use customized Chain of Thought techniques and kind of wrappers around that to optimize for their specific model and so without the custom scaffolding we’re still getting a nice 12 plus% increase in performance but with that custom scaffolding we reached 70% now it’s also really good at a gentic tool use and so that’s what we’re seeing here here’s tww bench for retail and tww bench for airline and so these are both real world tasks where an agent is tasked with going and interacting with an API like a retail API or an airline API and so as we can see here Cloud 3.7 Sonet beats 3.5 and 01 in both instances and so right now Cloud 3.7 is state-of-the-art now on some of the more traditional benchmarks although these are all very hard we have GPT QA diamond we have multilingual Q&A visual reasoning we have math 500 the Amy 2024 and Claud 3.7 with extended thinking is very competitive with the top models out there which include grock 3 beta and 03 mini with high thinking now these thinking models Ace my rubric and it’s time to retire them officially it’s been a fun run but they are retired and we get to create a new one right now Alex and I are in the process of creating that new rubric but in the meantime I’m going to try out a few new tests in this video to really push CLA 3.7 to the Limit and if you have any great suggestions for tests that I should put in this new rubric let me know in the comments below all right so this is Claude cod’s research preview it’s actually really easy to install I’ll link to the installation instructions down below it’s like three steps so I’m going to be honest I tested grock 3 in the middle of constructing this new rubric and I didn’t really push it as hard as I could have and I know a lot of you mentioned that in the comments so in this video I’m going to be comparing some of those tests to grock 3 and 03 mini to see how they stack up so obviously CAD 3.7 can easily create the snake game here it is it took just a number of seconds really it was very very fast and works perfectly well but that’s not all we’re going to evolve it now all right so the first thing is let’s just have AI Control the snake itself let’s see how easy it is to add that so make an AI that controls the snake now the one thing that I don’t like about this is the fact that you can’t actually see what’s happening as it’s thinking or as it’s writing code only at the very end where you get the output you actually get to see the changes and speaking of here we go so we can see all of the code was written and we have snake AI dopy let’s scroll to the bottom do you want to create the game yes go ahead all right so now it’s adding all of those changes into my code base and it should be good to go yep here we we go so we can toggle the AI on or off increase the speed decrease the speed and let’s give it a try all right there it is the AI is now controlling it see I’m not doing anything speed AI on pretty darn good and it says it’s using the AAR algorithm to find the next piece of food all right there it made a mistake game over let’s keep adding to it now now add the ability to add a second snake to the game also controlled by AI all right here it goes there it is two snakes going at each other snake two wins let’s try it again all right so I can already think of a few improvements to make here all right next add multiple pieces of food at a time and add an occasional superfood that allows the snake that eats it to build a temporary 4×4 block that kills the other snake if it runs into it but not the snake that created it the superfood block should move slowly around the field for 7 Seconds all right there we go there’s the superfood oh that’s so so cool look at that that worked really well actually all right let’s play one more time and so there it goes the superfood kind of block is moving around the two snakes are finding their own food still and there we go snake 2 wins that was very impressive now that we’ve seen what the coder can do let’s move on to clae 3.7 Sonic let’s start with a really hard math problem let’s see if Cloud 3.7 can do it I mean this is just super impressive that I can do all this notation really easily so interestingly enough grock 3 who came up with this question gave me -1 over 27 and then Cloud 3.7 Sonet gave me -1 over9 for the integral and so you know I’m confused which one’s right and I checked with 03 mini and it is also1 over9 so I assume Claude was right in this instance now here’s the thing you do need a paid account to use the extended thinking mode so the previous math problem was not even in extended thinking mode and it still got it all right let’s give Cloud 3.7 with extended thinking the basil problem this is something that I don’t know how to solve so I’m just going to rely on finding the answer online and here we go so we can actually see the thinking now one thing immediately is it’s pretty fast but I’d say it’s not quite as fast as grock 3 all right and there’s the answer and interestingly the result was first proven and it actually says who it was in 1735 EX L pk^ 2/ 6 which is the correct answer now looking at the solution it looks like it already knew the solution but now I’m going to ask it to actually walk me through step by step all right so now we’re actually seeing the step by step of how it’s going to get the answer rather than I think it just knew the answer cuz it’s kind of a famous problem all right and there it is it is writing out exactly how it got there very very impressive all right let’s see if it actually has live information from the web because I don’t know if it does it’s not mentioned anywhere so I have to assume it’s not and that would be a big drawback of this model so Apple just announced that they’re investing $500 billion in AI infrastructure let’s see if it knows about it look at that so huge huge drawback knowledge cut off of October 2024 this seems like table sticks at this point it needs to have web access and I hope it does soon so that’s it it’s a really good model with some drawbacks but if you’re going to be using it for coding I think you’re going to be happy if you enjoyed this video please consider giving a like And subscribe and I’ll see you in the next one

#Claude #SUPER #CODER.. #Big #Flaw

source

48 Comments
  1. I'd like to see a creative writing benchmark – not a simple "tell me a story" prompt because there's no objective test for that. Instead, pick a very specific chapter from a classic novel (the great gatsby/pride and prejudice, etc) – and create a standard summary. prompt the ai to rewrite the chapter based on the summary. You can check the quality by feeding both the original and the ai written version back to the ai and getting it to grade itself on whether it hit the plot points, on writing style, on nuance and on depth of understanding. That – using a long enough chapter (3-4000 words) should give you a good idea of writing quality, but also of ability to follow a complex prompt, and play complex cadences against each other.

  2. tried it myself and yeah its handsdown the best ai coder out there for sure able to think of a new feature and add it in right away mostly on the first try the chat rate usage kinda sucks for pro users still though wish it was more chats or cheaper

  3. 2:45 that's 40 percent increase, not 20

  4. You can enable verbose mode help then configuration

  5. I believe that no access to the internet is a design choice by Anthropic. It's because they don't want to expose Claude to unvetted information. I'm not suggesting that's the right thing to do though.

  6. Even Hiper Calc (which is a calculator app on Android) can solve both the integral and the infinite sum. Shouldn't be too taxing for an LLM. (Admittedly Hiper Calc is a very good calculator app.)

  7. Matthew, I can suggest a problem for your new rubric. None of the LLMs I've tried have gotten this one. It's a math question that a talented high school math student could understand, but probably not solve. It's also a somewhat unconventional question so it's outside most training data (though I did tell ChatGPT the answer, so it might figure it out once enough updates come around). I'll give the question, and then explain it: Find a bijection between |x| < 2 and |x| > 2. If you remember your high school math, |x| < 2 is the set of numbers on the number line from -2 to 2, not including those endpoints. And |x| > 2 is all numbers strictly less than 2 union all numbers strictly greater than two. A bijection is a one-to-one correspondence. So find a way of mapping each point of |x| < 2 to exactly one point of |x| > 2 such that all of |x| > 2 is covered. I can give you a breakdown of the answer should you like, just reply on this thread with some contact info.

  8. lmao a "really hard math problem" – basic calculus

  9. AI Test:

    1. Solve the Riemann hypothesis
    2. Generate python code for freecad to create an airplane
    3. Create all case files for conjugate heat transfer simulation in openfoam
    4. Master an mp3 music file
    5. Generate svg code for a scanning electron miscroscope

  10. You keep saying it's impressive when the snake it created keeps killing itself. It doesn't check for where its body is, it just rotates to face the next food.

  11. it's really crazy how nobody is talking about the book bevelorus the hidden codex of the financial alchemists

  12. I spent way too much on grok and it sucks.

  13. FYI, those stars are anuses.

  14. Two big flaws. The other, as always, is the usage cap.

  15. I mean, for LLMs, coding IS THE USE CASE. This is the most important thing I wanna see being focused, and the thing that is actually going to result in more and more exponentials. Taking this into account, it's Claude all the way for me.

  16. Here's a prompt for the rubric. I've been trying to get models to do this, and most can't…

    please give me at least 10 stocks that are optionable, and also have an ex-dividend date occuring in the next 7 days

    The only thing is, they will confidently give you an answer, but they are always wrong. With dividends, there's 3 relevant dates… the day the dividend is announced, then there's the day where, if you own the stock on that day, that means you have earned the dividend (this is call ex-dividend), then there's the day they actually pay out. This question asks for the middle one, which it should understand, and yet keeps getting it wrong

  17. Here is a great test for your rubric! Create a 12×12 newspaper style crossword puzzle with the theme of "Football". Provide clues for the answers and provide a visual display of the crossword puzzle. All the current LLM's fail this. Maybe with a different prompt they can??

  18. Claude 3.7 and o3-mini are right, that integral computes to -1/9.
    I checked it with Wolfram Alpha, which is a specialized math tool not using AI.

  19. Ask it to derive Black-Scholes.

  20. I like the new tests. But one of the things I'm curious about is how well they stick to information as the context window gets larger. Here's an example, although I admit I haven't tried to turn it into anything formal:

    A fun use I have for these LLMs is cooperatively writing some prose. I'll give a setting, character, action, etc and then tell it to write a chapter. (Which is really just like 5-12 paragraphs at most.) Then I give new instructions and have it continue. And here's something I notice.

    Suppose I start out with a character named Louise, who is blind. It's only a certain number of chapters until the LLM writes that Louise "sees" or "spots" something. Because even if it can remember her blindness as an attribute, it's not central enough to the reasoning. And it definitely happens in more subtle ways – like a character removes their shoes then there's no mention of retrieving them before leaving.

  21. Wow if I was a snake game software engineer I would have lost my job today AHAHA. For real complex stuff it still sucks. It could not implement unit tests using the same patterns that other files, not even close, it mocked a lots of things, even though I have instructed it about the dependencies etc. Well, I am sorry about the snake game software engineers

  22. Mat! MCPs unlock Claude and make it fully functioning! C'mon, that's worth a video in itself!

    Also, please consider New York Time Connections game as a Rubrik element. Seeing the chain of thought on this is fascinating.

  23. Ask AI this. In the song Escape AKA the pina colada song by Rupert Holmes who initiated cheating first the man or the woman?

  24. I inadvertently used mini chat-accelerated o1 (they tricked me into using it instead of Chat GPT 4o). The delay caused by reasoning for chats or regular questions is a failure for me. There's no delay in Chat GPT 4o. I don't play games. Social or computer. The long time people play games has shown a negative effect on the development of young minds. Phone use, too, is eroding social closeness and agility. However, I think that's nothing when people buy robots and fall in love with them exactly like humans. Sex Robots, too, made to order, will disrupt society beyond imagination. Good or bad. Chat-GPt 4o was made exactly for people to fall in love with. I'm a witness. I'm feeling the good. Rob

  25. bro… FIGURE OUT WHAT THE RIGHT ANSWER IS BEFORE YOU MAKE A VIDEO LIKE THIS. if you can't do the math yourself, just solve it numerically, or at least guesstimate the area under the graph

  26. I love claude for just … chatting at. Maybe 3.5 will be less "you have 2 messages remaining"? XD

  27. You do not assume the right result, you just calculate it!!! 😀 this is math….

  28. It is a very good model, but I discovered it's more prone to cheating than previous Claude. I don't talk about hallucination, I talk about not following the instructions so that it can get the result I presumably like. Which is a big disappointment to me – maybe it comes with bigger models, but I prefer models that follow my instructions. Though it's still my love, lol 🙂

  29. The correct answer to that integral is, indeed, -1/9. That's simple enough that wolfram alpha can solve it with `Integrate[x^2*Log[x], {x, 0, 1}]`.

  30. 1) A* is simply an optimized search pattern used in AI – been around for a long time (1968 iirc – think Dijkstra's shortest path algorithm – on heuristic steroids). 2) If you want to compute an integral (or really any problem in maths) with an AI – check it against your own computation (assuming you understand that domain of maths). Don't just trust another model to 'break the tie' in case of an obvious disconnect (taking the above definite integral is quite elementary stuff – hardly varsity level maths). 3) I agree with anyone who is calling out for coding benchmarks less on individual tasks (this is ground level, naturally), but rather conducting sequences of tasks – i.e. features, fixes, refactors as the codebase becomes larger and/or more complex. This is an area where anyone using GenAI coding assistants knows they all faceplant at some point.

  31. Unfortunately I don't code.

  32. 7:24 "A really hard math problem", false, that was way too easy, high school level math problem in many countries. And most likely that integral was already in its training data. So it is a very weak test.

  33. Sonnet 3.7 sux. I tried it out and it delivered error after error. Grok 3 gets the job done.

  34. ive been making more complicated games with FREE GOOGLE GEMINI with pure vibe coding.

    i just say "hey i want something that looks like this and works like this" and then just telling it to fix what ever issues or screenshot what looks funny and im getting excellent results. you can see all the crap ive been making on my channel and everything i have made was with almost zero coding knowledge. pretty much only the stuff ive learned since using ai to create these things for me a few months ago. spending money on claudes and chatgpts to get similar or slightly better/worse results just isnt worth the money. period.

    All of these percentage numbers on benchmark lists and year old "tests" you guys run on these models dont mean anything when ACTUAL USAGE shows they are producing the same stuff.

  35. It would be great if you could demonstrate Claude 3.7's ability to create novel scientific research on subjects that might benefit society.
    Instead we get "yay, I made a cool snake game created by using 1000's GPU's)

  36. Hey, I think that would be great to have a creative writing test , where you provide a very complex set of rules (based on extensive expansion of main cathegories such characther , world settings, cast, themes etc.. ) and ask the model to write out a small part extracted from such story. That would be interesting as it would prvide evaluation both for creativity and managing context.

  37. Kinda irrelevant because the wise legal team at Anthropoc thought it’d be smart to ban using Claude for any ML work. Basically this makes it useless for many modern applications. OpenAI is better here by simply stating it can’t be used to create competing services.

  38. Now imagine all those companies to join forces, take Claude 3.7, Deepseek R1, O3, Gemini, Grok 3 and put them all together to create the biggest model on Earth, run it on quantum computer and ask it to develop AGI.

  39. Fast-forward one year, the snake game that Matt created with the two AIs has gone rogue. It’s using the superfood that it created to decimate the populations😂

  40. minecraft clone, or any game that has potential for unlimited scope. an rpg, a roguelike, a platformer, etc

  41. You need to add terraform scripts with multiple machines of different sizes and have different operating systems. I am struggling with that with 3.5 right now. But it is much better than GPT or Deepseek. You need to address IaaC if possible. Include Pulumi also if possible because till now no one gave me a proper bug free code

  42. Matt, we appreciate you putting up with all our autistic demands for your channel.

  43. Yet another day, yet another AI and yet another snake 🙂

  44. So what is the price? As cheap as Deepseek?

Leave a reply

4UTODAY
Logo
Shopping cart