A bucket of 30 questions is not a statistically significant sample size which we can use to support the hypothesis which goes to say that all AI assistants they tested are 45% of the time wrong. That's not how science works.
Neither is my bucket of 30 questions statistcally significant but it goes to say that I can disprove their hypothesis just by giving them my sample.
I think that the report is being disingenious and I don't understand for what reasons. it's funny that they say "misrepresent" when that's exactly what they are doing.
I don't follow your reasoning re. statistical sample size. The topic article claims that 45% of the answers were wrong. If - with a vastly greater sample size - the answers were "only" (let's say) 20% wrong, that's still a complete failure, so is 5%. The article is not about hypothesis, it's about news reporting.
Neither is my bucket of 30 questions statistcally significant but it goes to say that I can disprove their hypothesis just by giving them my sample.
I think that the report is being disingenious and I don't understand for what reasons. it's funny that they say "misrepresent" when that's exactly what they are doing.