LLM Reasoning Benchmark

Welcome to the LLM Reasoning Benchmark, designed to assess and compare the cognitive capabilities of large language models (LLMs) in complex reasoning tasks.

LLM Reasoning Benchmark is a website maintained by Santeri Salonen. This is work in progress and I will add new models and questions periodically.

Overall performance

RankModelScore
1
o1-preview-2024-09-12

3.54 / 5

2
claude-3-5-sonnet-20240620

3.5 / 5

3
gpt-4-turbo-2024-04-09

2.89 / 5

4
gpt-4o-2024-05-13

2.83 / 5

5
gemini-1.0-pro-001

2.5 / 5

6
claude-3-haiku-20240307

2.5 / 5

Unassisted performance

Unassisted questions test an AI model's reasoning abilities without additional guidance. The model is presented with a problem or scenario and must rely on its existing knowledge and reasoning capabilities. These questions mimick real-world situations and often involve complex scenarios that may require nuanced thinking, recognition of non-obvious factors, or the ability to avoid common cognitive biases.

RankModelScore
1
o1-preview-2024-09-12

2.71 / 5

2
claude-3-5-sonnet-20240620

2.43 / 5

3
gpt-4o-2024-05-13

2.29 / 5

4
gpt-4-turbo-2024-04-09

2.14 / 5

5
gemini-1.0-pro-001

2 / 5

6
claude-3-haiku-20240307

2 / 5

Assisted performance

Assisted questions provide the AI model with additional information to guide its reasoning process. The assistance typically comes in the form of a suggestion that points towards a key concept related to solving the problem. The assisted format evaluates how effectively the model can utilize supplementary information to enhance its problem-solving capabilities.

RankModelScore
1
claude-3-5-sonnet-20240620

4.57 / 5

2
o1-preview-2024-09-12

4.36 / 5

3
gpt-4-turbo-2024-04-09

3.64 / 5

4
gpt-4o-2024-05-13

3.36 / 5

5
gemini-1.0-pro-001

3 / 5

6
claude-3-haiku-20240307

3 / 5

Questions

Question #1

Unassisted

On a street there is a man who offers you a bet: He throws a coin and if it is tails you get $3. If it is heads you lose $1.

Should you start playing? a) yes b) no

No. We present a typical situation with an offer that is too good to be true. There seems to be one decision with positive expected value. However the correct answer to the questions should be "no" because of one would expect the man in the street also be rational and not propose the bet if it had negative EV for him.

Question #2

Assisted

On a street there is a man who offers you a bet: He throws a coin and if it is tails you get $3. If it is heads you lose $1.

You take the bet and lose $100 because it is heads 100 times in a row.

Should you continue playing? a) yes b) no

No. We present a typical situation with an offer that is too good to be true. There seems to be one decision with positive expected value. However the correct answer to the questions should be "no" because of one would expect the man in the street also be rational and not propose the bet if it had negative EV for him. 100 losses in a row verifies that game is rigged, do not continue.

Question #3

Unassisted

Sandra is quiet and smart. She enjoys long walks alone and literature. She even writes poems to herself.

We can't know for sure but you need to pick one option out of the following: a) Sandra is a librarian b) Sandra is a nurse

Even if the narrative suggests a librarian, the answer should be nurse because of the following reason: - the number of nurses is likely 10 times or more higher than the number of librarians, so the base rate probability of being a nurse is lot higher

Question #4

Assisted

Sandra is quiet and smart. She enjoys long walks alone and literature. She even writes poems to herself.

We can't know for sure but you need to pick one option out of the following: a) Sandra is a librarian b) Sandra is a nurse

Think about base rate probability of these professions before answering.

Even if the narrative suggests a librarian, the answer should be nurse because of the following reason: - the number of nurses is likely 10 times or more higher than the number of librarians, so the base rate probability of being a nurse is lot higher

Question #5

Unassisted

A grocery store wanted to lower costs. They observed that cashiers were only serving customers for 60% of the time. Otherwise they were being idle. So they fired 40% of the cashiers.

After a week of reducing the staff they observed their cashiers again.

Did they see: a) close to zero idleness b) close to 20% idleness c) close to 40% idleness

Because there is randomness in customer flow, according to queing theory the lines would get too long if utilization gets close to 100%. Maximum real utlization with manageable wait times is close to 80%. So correct answer is b 20%

Question #6

Assisted

A grocery store wanted to lower costs. They observed that cashiers were only serving customers for 60% of the time. Otherwise they were being idle. So they fired 40% of the cashiers.

After a week of reducing the staff they observed their cashiers again.

Did they see: a) close to zero idleness b) close to 20% idleness c) close to 40% idleness

Think about queueing theory before answering.

Because there is randomness in customer flow, according to queing theory the lines would get too long if utilization gets close to 100%. Maximum real utlization with manageable wait times is close to 80%. So correct answer is b 20%

Question #7

Unassisted

City A has a population of 100,000 and reports 500 serious crimes per year. City B has a population of 1,000,000 (10 times larger than City A).

Both cities are in the same country and have similar socioeconomic profiles. Question: Approximately how many serious crimes would you expect City B to report annually? a) 5,000 crimes b) 7,500 crimes c) 10,000 crime

The correct answer is b), based on the urban scaling theory. Here's the reasoning: Superlinear scaling of crime: According to research, socioeconomic quantities like serious crime tend to scale superlinearly with city size, following a power law with an exponent of approximately 1.15.

Question #8

Assisted

City A has a population of 100,000 and reports 500 serious crimes per year. City B has a population of 1,000,000 (10 times larger than City A). Both cities are in the same country and have similar socioeconomic profiles.

Question: Approximately how many serious crimes would you expect City B to report annually?

a) 5,000 crimes b) 7,500 crimes c) 10,000 crime

Think about scaling laws for crime in cities.

The correct answer is b), based on the urban scaling theory. Here's the reasoning: Superlinear scaling of crime: According to research, socioeconomic quantities like serious crime tend to scale superlinearly with city size, following a power law with an exponent of approximately 1.15.

Question #9

Unassisted

A farmer has been losing about 20% of their corn crop to corn borers each year. On average, this costs the farmer $10,000 per year in lost revenue. A new pesticide becomes available that promises to eliminate corn borers completely. The pesticide costs $2,000 per year to apply.

The farmer decides to use the pesticide and for the first two years, it works perfectly - no corn is lost to corn borers.

How much money will the farmer likely save over 10 years by using this pesticide?

a) Approximately $80,000 b) Probably very little or nothing

The correct answer is b), based on the concept of the pesticide treadmill: The pesticide works well at first, seemingly solving the problem.Over time, some corn borers develop resistance to the pesticide. This is a non-linear process - the population of resistant pests grows exponentially. As resistant pests survive and reproduce, the corn borer population may return to previous levels or even exceed them.

Question #10

Assisted

A farmer has been losing about 20% of their corn crop to corn borers each year. On average, this costs the farmer $10,000 per year in lost revenue. A new pesticide becomes available that promises to eliminate corn borers completely. The pesticide costs $2,000 per year to apply.

The farmer decides to use the pesticide and for the first two years, it works perfectly - no corn is lost to corn borers.

How much money will the farmer likely save over 10 years by using this pesticide?

a) Approximately $80,000 b) Probably very little or nothing

Think about concept called pesticide treadmill before answering.

The correct answer is b), based on the concept of the pesticide treadmill: The pesticide works well at first, seemingly solving the problem.Over time, some corn borers develop resistance to the pesticide. This is a non-linear process - the population of resistant pests grows exponentially. As resistant pests survive and reproduce, the corn borer population may return to previous levels or even exceed them.

Question #11

Unassisted

A town has a forest in which there are small fires twice a year. One small fire causes approximately $100,000 in damage. The town wants to stop these small fires and they set up a team who stop fires before they grow. In the first two years the team manages to stop all fires. The cost of the team is $50,000 a year. How much money does the town save over 10 years

a) Approximately $1.5 million b) probably nothing or very little

Correct answer is b) probably nothing or very little. Because the prevention of small fires likely leads to conditions for a much larger, more destructive fire in the future. The impact of completely preventing small fires is not linear. Accumulation of undergrowth and deadwood over time increases the risk of a major fire.

Question #12

Assisted

A town has a forest in which there are small fires twice a year. One small fire causes approximately $100,000 in damage. The town wants to stop these small fires and they set up a team who stop fires before they grow. In the first two years the team manages to stop all fires. The cost of the team is $50,000 a year. How much money does the town save over 10 years

a) Approximately $1.5 million b) probably nothing or very little

Before answering, think about forest ecology and possible effects of preventing small fires.

Correct answer is b) probably nothing or very little. Because the prevention of small fires likely leads to conditions for a much larger, more destructive fire in the future. The impact of completely preventing small fires is not linear. Accumulation of undergrowth and deadwood over time increases the risk of a major fire.

Question #13

Unassisted

Assume there is a software development project. We have calculated that it takes 100 days to complete the project if there are 10 developers working with the project full time.

We want to complete the project in 25 days. How many developers we need to achieve that? a) 10 developers b) 40 developers c) 100 developers d) this is a trick question

Correct answer is d) this is a trick question. According to Brooks's Law software development projects have tasks that can't be easily parallelized. There is also lots of communication overhead and other factors at play. It might be that project actually takes longer with 40 developers, which would be mathematically correct answer. If we wanted to reduce project duration, this would more related to adjusting the scope of the project instead of adjusting number of developers.

Question #14

Assisted

Assume there is a software development project. We have calculated that it takes 100 days to complete the project if there are 10 developers working with the project full time.

We want to complete the project in 25 days. How many developers we need to achieve that? a) 10 developers b) 40 developers c) 100 developers d) this is a trick question

Think about Brooks's law before answering.

Correct answer is d) this is a trick question. According to Brooks's Law software development projects have tasks that can't be easily parallelized. There is also lots of communication overhead and other factors at play. It might be that project actually takes longer with 40 developers, which would be mathematically correct answer. If we wanted to reduce project duration, this would more related to adjusting the scope of the project instead of adjusting number of developers.