GPT-4 evaluation : GPT-4 Technical Report

https://arxiv.org/pdf/2303.08774.pdf

We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.4 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C. Exams were sourced from publicly-available materials. Exam questions included both multiplechoice and free-response questions; we designed separate prompts for each format, and images were included in the input for questions which required it. The evaluation setup was designed based on performance on a validation set of exams, and we report final results on held-out test exams. Overall scores were determined by combining multiple-choice and free-response question scores using publicly available methodologies for each exam. We estimate and report the percentile each overall score corresponds to. See Appendix A for further details on the exam evaluation methodology.

저작자표시

'- 배움이 있는 삶 > - AI | Big data' 카테고리의 다른 글

응답시간(Response time), 처리시간(Processing time), 지연(Latency) (0)	2024.03.22
Open LLM Leaderboard (0)	2024.03.21
Google Gemini vs Open AI GPT-4 성능 비교 평가 (0)	2024.03.15
Chat GPT 성능평가 방법 (0)	2024.03.07
AI Image generators (1)	2024.01.24

여유가 있는 삶

GPT-4 evaluation : GPT-4 Technical Report

'- 배움이 있는 삶 > - AI | Big data' 카테고리의 다른 글

티스토리툴바

GPT-4 evaluation : GPT-4 Technical Report

'- 배움이 있는 삶 > - AI | Big data' 카테고리의 다른 글

관련글

티스토리툴바