본문 바로가기
- 배움이 있는 삶/- AI | Big data

GPT-4 evaluation : GPT-4 Technical Report

by story of interesting 2024. 3. 21.
반응형

https://arxiv.org/pdf/2303.08774.pdf

We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.4 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C. Exams were sourced from publicly-available materials. Exam questions included both multiplechoice and free-response questions; we designed separate prompts for each format, and images were included in the input for questions which required it. The evaluation setup was designed based on performance on a validation set of exams, and we report final results on held-out test exams. Overall scores were determined by combining multiple-choice and free-response question scores using publicly available methodologies for each exam. We estimate and report the percentile each overall score corresponds to. See Appendix A for further details on the exam evaluation methodology.

반응형