Qwen3vl 파인 튜닝으로 프롬프트 최소화 및 일관성 개선: 119-> 37토큰 개선

YOLO를 이용한 button 찾기를 Fail할때, LLM (GPT) API를 이용해서 Button 찾기를 수행했었습니다.
그러나 최근에 Fail 발생이 늘어났습니다. 그 이유는 왼쪽과 같이 Fake X 버튼이 생겼기 때문입니다. 아직 이 문제는 해결하지 못했지만, LLM 쿼리가 많아져 로컬 LLM의 도입이 필요 해졌습니다.

기존 LLM 기반 접근과 한계
- 처음에는 화면 구성 XML을 가지고 위치를 추측하려고 했으나, XML로는 추측이 불가능했습니다. 모든 요소들이 description 이나 무슨 목적의 component 인지 hint가 모두 막혀 있었습니다.
VLM 으로 전환 – Basemodel Test
Qwen3-VL 은 VLM 분야에서 유명한 모델이라 검증 대상으로 선택 했습니다.

Prompt -A
<|im_start|>system
You are a helpful mobile UI expert that analyzes app screenshots.
Your task is to locate the close or skip button in advertisements.
<|im_end|>
<|im_start|>user
Look at the image and return ONLY a JSON object in this exact schema:
{
“x”: ,
“y”: ,
“confidence”: <0.0~1.0>,
}
Rules:

Coordinates are absolute pixels (origin top-left)
Respond ONLY with valid JSON (no explanations)

위와 같은 두개의 Prompt를 이용했고, 각각의 Token수는 119개, 37개 입니다.

BaseModel의 테스트 결과

Prompt – A
{
“x”: 848,
“y”: 620,
“confidence”: 0.97
}
Prompt -B
“`json
[
{“bbox_2d”: [775, 825, 999, 900], “label”: “close or skip button in the advertisement”}
]
“`

결과를 보면 긴 Prompt와 짧은 Prompt 간의 출력 차이가 있습니다.

3. 데이터셋 구성
– 기존에 ChatGPT API를 이용했던 Fail 케이스를 모아놓았던 것을 이용했습니다.
– Input 이미지와, x,y 좌표와 reason 으로 구성되었습니다.
– 관련 내용은 기존 blog 참고 – https://flywithu.com/archives/8063

4. 파인 튜닝
– Colab은 최대 세션이 4~5 시간으로, 긴 학습엔 적합하지 않았습니다.
– 차선책으로 https://lightning.ai/ 를 이용했으며, 한달 기준 Colab 보다 GPU사용량은 적지만 연속 이용이 가능했습니다.

5. 실험 결과 (CPU 입니다. GPU사용시 2~3s걸립니다.)

항목	Fine – Prompt B	Base – Prompt B	Base – Prompt A
JSON 완성도	Pass (100%)	Fail	Pass(Markdown JSON)
in_tok / out_tok	37/23	37/51	119/33
속도	66s	80s	73s

--- Fine-Tuned Model + Short Prompt B
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
[debug] Raw model output:
system
You are a helpful mobile UI expert.
user

Find the close or skip button in the advertisement and return JSON coordinates.
assistant
{"x": 80, "y": 58, "confidence": 1.0}

x=80, y=58, time=66.193s , in_tok=37, out_tok=23

--- Baseline Model + Short Prompt B
[debug] Raw model output:
system
You are a helpful mobile UI expert.
user

Find the close or skip button in the advertisement and return JSON coordinates.
assistant
```json
[
  {
    "x": 50,
    "y": 35,
    "width": 100,
    "height": 50,
    "label": "skip"
  }
]
```

x=50, y=35, time=80.670s , in_tok=37, out_tok=51

--- Baseline Model + Long Prompt A
[debug] Raw model output:
system
You are a helpful mobile UI expert that analyzes app screenshots.
Your task is to locate the close or skip button in advertisements.
user
Look at the image and return ONLY a JSON object in this exact schema:
{
  "x": <integer pixel x>,
  "y": <integer pixel y>,
  "confidence": <0.0~1.0>,
}
Rules:
- Coordinates are absolute pixels (origin top-left)
- Respond ONLY with valid JSON (no explanations)


Where should I click to close the advertisement?
assistant
```json
{
  "x": 50,
  "y": 40,
  "confidence": 0.98
}
```

x=50, y=40, time=73.148s , in_tok=119, out_tok=33

6. 결론
파인튜닝을 통해 Task 내제화 -> 안정적 출력
지나치게 긴 Prompt 제거 -> 토큰 감소, Decode 비용 감소, 속도 향상
장기간 학습에는 Lighting.ai 고려 필요

Leave a Reply Cancel reply