{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "41a28d97",
   "metadata": {},
   "source": [
    "# C3 — Colectare comentarii YouTube\n",
    "În acest notebook colectăm un eșantion  de comentarii publice de pe YouTube.\n",
    "Scopul nu este să obținem corpusul final mare, ci să înțelegem fluxul:\n",
    "sursă → API → comentarii brute → fișier JSONL.\n",
    "La final, fiecare student salvează propriul fișier în `data/raw/`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "211f081d",
   "metadata": {},
   "source": [
    "## 1. Ce trebuie să avem pregătit\n",
    "Avem nevoie de:\n",
    "- fișier `.env` în root-ul proiectului\n",
    "- cheia `YOUTUBE_API_KEY`\n",
    "- un handle de canal YouTube\n",
    "Exemplu în `.env`:\n",
    "```text\n",
    "YOUTUBE_API_KEY=cheia_ta_aici"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3ab0591b",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from pathlib import Path\n",
    "import os\n",
    "import json\n",
    "import requests\n",
    "from datetime import datetime\n",
    "from dotenv import load_dotenv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c40292d4",
   "metadata": {},
   "source": [
    "## 2. Încărcăm cheia API\n",
    "Notebook-ul caută fișierul `.env` în root-ul proiectului.\n",
    "Dacă cheia nu este găsită, colectarea nu poate porni."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bc605e11",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Root proiect: c:\\PROJECTS\\echochamber-app\n",
      "Cheie găsită: True\n"
     ]
    }
   ],
   "source": [
    "ROOT = Path.cwd()\n",
    "while not (ROOT / \".env\").exists() and ROOT.parent != ROOT:\n",
    "    ROOT = ROOT.parent\n",
    "load_dotenv(ROOT / \".env\")\n",
    "API_KEY = os.getenv(\"YOUTUBE_API_KEY\")\n",
    "BASE_URL = \"https://www.googleapis.com/youtube/v3\"\n",
    "print(\"Root proiect:\", ROOT)\n",
    "print(\"Cheie găsită:\", API_KEY is not None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a379235",
   "metadata": {},
   "source": [
    "## 3. Alegem canalul și numărul de videoclipuri\n",
    "Fiecare student schimbă `student_id` și `handle`.\n",
    "Pentru exercițiu folosim puține videoclipuri, ca să nu consumăm inutil cota API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "10b64058",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "c:\\PROJECTS\\echochamber-app\\data\\raw\\student_01_youtube_raw.jsonl\n"
     ]
    }
   ],
   "source": [
    "student_id = \"student_01\"\n",
    "handle = \"digi24hd56\"\n",
    "max_videos = 2\n",
    "max_comments_per_video = 100\n",
    "output_file = ROOT / \"data\" / \"raw\" / f\"{student_id}_youtube_raw.jsonl\"\n",
    "print(output_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c48cb20",
   "metadata": {},
   "source": [
    "## 4. Găsim canalul YouTube\n",
    "\n",
    "YouTube lucrează intern cu `channel_id`, nu direct cu numele canalului.\n",
    "De aceea, primul pas este să transformăm handle-ul în `channel_id`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6dfd4d64",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'kind': 'youtube#channelListResponse',\n",
       " 'etag': 'bPGHk7xg5axBT4a3koCcnNny28s',\n",
       " 'pageInfo': {'totalResults': 1, 'resultsPerPage': 5},\n",
       " 'items': [{'kind': 'youtube#channel',\n",
       "   'etag': 'BhuTK97GknHy20Igr-aKGXPuJdU',\n",
       "   'id': 'UCbvKamSrJkwT6ed2BMMZXwg'}]}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "channel_response = requests.get(\n",
    "    f\"{BASE_URL}/channels\",\n",
    "    params={\n",
    "        \"part\": \"id\",\n",
    "        \"forHandle\": handle,\n",
    "        \"key\": API_KEY\n",
    "    }\n",
    ")\n",
    "channel_data = channel_response.json()\n",
    "channel_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "36112175",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'UCbvKamSrJkwT6ed2BMMZXwg'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "channel_id = channel_data[\"items\"][0][\"id\"]\n",
    "channel_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "727248a5",
   "metadata": {},
   "source": [
    "## 5. Luăm cele mai recente videoclipuri\n",
    "Acum cerem ultimele videoclipuri publicate de canal.\n",
    "Pentru curs folosim doar câteva videoclipuri."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "fed96dc9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'kind': 'youtube#searchResult',\n",
       " 'etag': 'pW-oMrmuV_wk35JIH55B1oWoQJs',\n",
       " 'id': {'kind': 'youtube#video', 'videoId': 'Fyk8Ob7CRjw'},\n",
       " 'snippet': {'publishedAt': '2026-05-04T09:51:48Z',\n",
       "  'channelId': 'UCbvKamSrJkwT6ed2BMMZXwg',\n",
       "  'title': '🟣 Știrile Digi24 de la ora 12 – 4 mai 2026',\n",
       "  'description': 'Știrile Digi24 de la ora 12 – 4 mai 2026 ➥ Pentru mai multe știri vizitează site-ul Digi24 - https://www.digi24.ro/ ➥ Abonează-te la ...',\n",
       "  'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/Fyk8Ob7CRjw/default.jpg',\n",
       "    'width': 120,\n",
       "    'height': 90},\n",
       "   'medium': {'url': 'https://i.ytimg.com/vi/Fyk8Ob7CRjw/mqdefault.jpg',\n",
       "    'width': 320,\n",
       "    'height': 180},\n",
       "   'high': {'url': 'https://i.ytimg.com/vi/Fyk8Ob7CRjw/hqdefault.jpg',\n",
       "    'width': 480,\n",
       "    'height': 360}},\n",
       "  'channelTitle': 'Digi24HD',\n",
       "  'liveBroadcastContent': 'none',\n",
       "  'publishTime': '2026-05-04T09:51:48Z'}}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "videos_response = requests.get(\n",
    "    f\"{BASE_URL}/search\",\n",
    "    params={\n",
    "        \"part\": \"snippet\",\n",
    "        \"channelId\": channel_id,\n",
    "        \"type\": \"video\",\n",
    "        \"order\": \"date\",\n",
    "        \"maxResults\": max_videos,\n",
    "        \"key\": API_KEY\n",
    "    }\n",
    ")\n",
    "videos_data = videos_response.json()\n",
    "videos_data[\"items\"][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "80561684",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'video_id': 'Fyk8Ob7CRjw',\n",
       "  'video_title': '🟣 Știrile Digi24 de la ora 12 – 4 mai 2026',\n",
       "  'video_date': '2026-05-04'},\n",
       " {'video_id': 'VV9sV-eBVeA',\n",
       "  'video_title': '#PetStory: Povestea cățelușei Eli #digi24',\n",
       "  'video_date': '2026-05-04'}]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "videos = []\n",
    "for item in videos_data[\"items\"]:\n",
    "    videos.append({\n",
    "        \"video_id\": item[\"id\"][\"videoId\"],\n",
    "        \"video_title\": item[\"snippet\"][\"title\"],\n",
    "        \"video_date\": item[\"snippet\"][\"publishedAt\"][:10]\n",
    "    })\n",
    "videos"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47aba7af",
   "metadata": {},
   "source": [
    "## 6. Colectăm comentariile\n",
    "Pentru fiecare videoclip luăm comentariile publice ordonate după relevanță.\n",
    "În acest exercițiu nu folosim paginare, deci luăm maximum 100 comentarii per videoclip."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "799d9da0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Colectez: 🟣 Știrile Digi24 de la ora 12 – 4 mai 2026\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\alexe\\AppData\\Local\\Temp\\ipykernel_29276\\3206550858.py:28: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).\n",
      "  \"collected_at\": datetime.utcnow().strftime(\"%Y-%m-%d\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Colectez: #PetStory: Povestea cățelușei Eli #digi24\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comments = []\n",
    "for video in videos:\n",
    "    print(\"Colectez:\", video[\"video_title\"][:80])\n",
    "    comments_response = requests.get(\n",
    "        f\"{BASE_URL}/commentThreads\",\n",
    "        params={\n",
    "            \"part\": \"snippet\",\n",
    "            \"videoId\": video[\"video_id\"],\n",
    "            \"maxResults\": max_comments_per_video,\n",
    "            \"textFormat\": \"plainText\",\n",
    "            \"order\": \"relevance\",\n",
    "            \"key\": API_KEY\n",
    "        }\n",
    "    )\n",
    "    comments_data = comments_response.json()\n",
    "    for comment_item in comments_data.get(\"items\", []):\n",
    "        snippet = comment_item[\"snippet\"][\"topLevelComment\"][\"snippet\"]\n",
    "        record = {\n",
    "            \"id\": f\"yt_{video['video_id']}_{comment_item['id']}\",\n",
    "            \"source_platform\": \"youtube\",\n",
    "            \"source_channel\": handle,\n",
    "            \"text_raw\": snippet[\"textDisplay\"],\n",
    "            \"video_id\": video[\"video_id\"],\n",
    "            \"video_title\": video[\"video_title\"],\n",
    "            \"video_date\": video[\"video_date\"],\n",
    "            \"comment_date\": snippet[\"publishedAt\"][:10],\n",
    "            \"likes\": snippet[\"likeCount\"],\n",
    "            \"collected_at\": datetime.utcnow().strftime(\"%Y-%m-%d\")\n",
    "        }\n",
    "        comments.append(record)\n",
    "len(comments)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39e538d5",
   "metadata": {},
   "source": [
    "# Explorare si curatare"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be59b927",
   "metadata": {},
   "source": [
    "## 7. Inspectăm primele comentarii\n",
    "Înainte să salvăm fișierul, verificăm dacă datele arată cum trebuie."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "bd7a31f0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'id': 'yt_Fyk8Ob7CRjw_Ugwc6D97EMahyFzA7Xx4AaABAg',\n",
       "  'source_platform': 'youtube',\n",
       "  'source_channel': 'digi24hd56',\n",
       "  'text_raw': 'TV  DIGI  24   TV  A  SISTEMULUI   !!!!',\n",
       "  'video_id': 'Fyk8Ob7CRjw',\n",
       "  'video_title': '🟣 Știrile Digi24 de la ora 12 – 4 mai 2026',\n",
       "  'video_date': '2026-05-04',\n",
       "  'comment_date': '2026-05-04',\n",
       "  'likes': 0,\n",
       "  'collected_at': '2026-05-04'},\n",
       " {'id': 'yt_Fyk8Ob7CRjw_Ugz3uhxyQ_w2qF4MJdF4AaABAg',\n",
       "  'source_platform': 'youtube',\n",
       "  'source_channel': 'digi24hd56',\n",
       "  'text_raw': 'Auristule ați pierdut din start facind o coaliție cu PSD  va compromis o sa va invită psd',\n",
       "  'video_id': 'Fyk8Ob7CRjw',\n",
       "  'video_title': '🟣 Știrile Digi24 de la ora 12 – 4 mai 2026',\n",
       "  'video_date': '2026-05-04',\n",
       "  'comment_date': '2026-05-04',\n",
       "  'likes': 0,\n",
       "  'collected_at': '2026-05-04'}]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comments[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "92ebc328",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['id', 'source_platform', 'source_channel', 'text_raw', 'video_id', 'video_title', 'video_date', 'comment_date', 'likes', 'collected_at'])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "comments[0].keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2d0a632",
   "metadata": {},
   "source": [
    "## 8. Curățare minimă a textului\n",
    "Acum pornim de la `text_raw` și construim o variantă curățată în câmpul `text`.\n",
    "Nu schimbăm sensul comentariului. Eliminăm doar zgomot simplu: linkuri, spații inutile, texte prea scurte și duplicate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c58d4c59",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def clean_text(text):\n",
    "    text = re.sub(r\"http\\S+\", \"\", text)      # elimină linkuri\n",
    "    text = re.sub(r\"\\s+\", \" \", text)         # normalizează spațiile\n",
    "    return text.strip()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b82f5c21",
   "metadata": {},
   "source": [
    "## 9. Aplicăm curățarea\n",
    "Pentru fiecare comentariu păstrăm textul original în `text_raw` și adăugăm textul curățat în `text`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "7db7f57b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'id': 'yt_Fyk8Ob7CRjw_Ugwc6D97EMahyFzA7Xx4AaABAg',\n",
       " 'source_platform': 'youtube',\n",
       " 'source_channel': 'digi24hd56',\n",
       " 'text_raw': 'TV  DIGI  24   TV  A  SISTEMULUI   !!!!',\n",
       " 'video_id': 'Fyk8Ob7CRjw',\n",
       " 'video_title': '🟣 Știrile Digi24 de la ora 12 – 4 mai 2026',\n",
       " 'video_date': '2026-05-04',\n",
       " 'comment_date': '2026-05-04',\n",
       " 'likes': 0,\n",
       " 'collected_at': '2026-05-04',\n",
       " 'text': 'TV DIGI 24 TV A SISTEMULUI !!!!'}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for comment in comments:\n",
    "    comment[\"text\"] = clean_text(comment[\"text_raw\"])\n",
    "\n",
    "comments[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6aa2e7ea",
   "metadata": {},
   "source": [
    "## 10. Filtrăm comentariile prea scurte\n",
    "Pentru exercițiu păstrăm doar comentariile care au cel puțin 60 de caractere.\n",
    "Comentariile foarte scurte sunt greu de interpretat în analiza discursivă."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "b38504cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comentarii brute: 2\n",
      "Comentarii după filtrarea lungimii: 1\n"
     ]
    }
   ],
   "source": [
    "MIN_CHARS = 60\n",
    "\n",
    "comments_clean = [\n",
    "    comment for comment in comments\n",
    "    if len(comment[\"text\"]) >= MIN_CHARS\n",
    "]\n",
    "\n",
    "print(\"Comentarii brute:\", len(comments))\n",
    "print(\"Comentarii după filtrarea lungimii:\", len(comments_clean))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddd65203",
   "metadata": {},
   "source": [
    "## 11. Filtrăm textele cu prea puține litere\n",
    "Comentariile formate mai ales din emoji, simboluri sau caractere izolate produc zgomot.\n",
    "Păstrăm comentariile în care cel puțin 50% dintre caractere sunt litere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "cb2f4ff6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comentarii după filtrarea literelor: 1\n"
     ]
    }
   ],
   "source": [
    "MIN_ALPHA = 0.5\n",
    "\n",
    "def alpha_ratio(text):\n",
    "    if len(text) == 0:\n",
    "        return 0\n",
    "    letters = sum(char.isalpha() for char in text)\n",
    "    return letters / len(text)\n",
    "\n",
    "comments_clean = [\n",
    "    comment for comment in comments_clean\n",
    "    if alpha_ratio(comment[\"text\"]) >= MIN_ALPHA\n",
    "]\n",
    "\n",
    "print(\"Comentarii după filtrarea literelor:\", len(comments_clean))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ff7ffc6",
   "metadata": {},
   "source": [
    "## 12. Eliminăm duplicatele\n",
    "Dacă același text apare de mai multe ori, îl păstrăm o singură dată."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "8a84753b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comentarii finale după deduplicare: 1\n"
     ]
    }
   ],
   "source": [
    "seen_texts = set()\n",
    "unique_comments = []\n",
    "\n",
    "for comment in comments_clean:\n",
    "    text = comment[\"text\"].lower()\n",
    "    if text not in seen_texts:\n",
    "        unique_comments.append(comment)\n",
    "        seen_texts.add(text)\n",
    "\n",
    "comments_clean = unique_comments\n",
    "\n",
    "print(\"Comentarii finale după deduplicare:\", len(comments_clean))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b7dafca",
   "metadata": {},
   "source": [
    "## 14. Salvăm fișierul curățat\n",
    "Salvăm rezultatul în `data/cleaned/`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a393f35c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comentarii curate salvate: 1\n",
      "Fișier: c:\\PROJECTS\\echochamber-app\\data\\cleaned\\student_01_youtube_clean.jsonl\n"
     ]
    }
   ],
   "source": [
    "clean_output_file = ROOT / \"data\" / \"cleaned\" / f\"{student_id}_youtube_clean.jsonl\"\n",
    "clean_output_file.parent.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "with clean_output_file.open(\"w\", encoding=\"utf-8\") as f:\n",
    "    for comment in comments_clean:\n",
    "        f.write(json.dumps(comment, ensure_ascii=False) + \"\\n\")\n",
    "\n",
    "print(\"Comentarii curate salvate:\", len(comments_clean))\n",
    "print(\"Fișier:\", clean_output_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab0d7aa4",
   "metadata": {},
   "source": [
    "# Functia de curatare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "03c42a30",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def clean_comments(comments, min_chars=60, min_alpha=0.5):\n",
    "    cleaned = []\n",
    "    seen_texts = set()\n",
    "    \n",
    "    for comment in comments:\n",
    "        # 1. Curățare text\n",
    "        text = comment[\"text_raw\"]\n",
    "        text = re.sub(r\"http\\S+\", \"\", text)\n",
    "        text = re.sub(r\"\\s+\", \" \", text).strip()\n",
    "        \n",
    "        # 2. Filtru lungime\n",
    "        if len(text) < min_chars:\n",
    "            continue\n",
    "        \n",
    "        # 3. Filtru proporție litere\n",
    "        letters = sum(char.isalpha() for char in text)\n",
    "        alpha_ratio = letters / len(text) if len(text) > 0 else 0\n",
    "        \n",
    "        if alpha_ratio < min_alpha:\n",
    "            continue\n",
    "        \n",
    "        # 4. Filtru duplicate\n",
    "        text_key = text.lower()\n",
    "        if text_key in seen_texts:\n",
    "            continue\n",
    "        \n",
    "        seen_texts.add(text_key)\n",
    "        \n",
    "        # 5. Păstrăm comentariul și adăugăm textul curățat\n",
    "        new_comment = comment.copy()\n",
    "        new_comment[\"text\"] = text\n",
    "        new_comment[\"lang\"] = \"ro\"\n",
    "        cleaned.append(new_comment)\n",
    "    \n",
    "    return cleaned"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "878548e7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comentarii brute: 2\n",
      "Comentarii curate: 1\n"
     ]
    }
   ],
   "source": [
    "comments_clean = clean_comments(\n",
    "    comments,\n",
    "    min_chars=60,\n",
    "    min_alpha=0.5\n",
    ")\n",
    "\n",
    "print(\"Comentarii brute:\", len(comments))\n",
    "print(\"Comentarii curate:\", len(comments_clean))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "4c3f9fc8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RAW: Auristule ați pierdut din start facind o coaliție cu PSD  va compromis o sa va invită psd\n",
      "CLEAN: Auristule ați pierdut din start facind o coaliție cu PSD va compromis o sa va invită psd\n",
      "---\n"
     ]
    }
   ],
   "source": [
    "for comment in comments_clean[:3]:\n",
    "    print(\"RAW:\", comment[\"text_raw\"])\n",
    "    print(\"CLEAN:\", comment[\"text\"])\n",
    "    print(\"---\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccffd13f",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_output_file = ROOT / \"data\" / \"cleaned\" / f\"{student_id}_youtube_clean.jsonl\"\n",
    "clean_output_file.parent.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "with clean_output_file.open(\"w\", encoding=\"utf-8\") as f:\n",
    "    for comment in comments_clean:\n",
    "        f.write(json.dumps(comment, ensure_ascii=False) + \"\\n\")\n",
    "\n",
    "print(\"Fișier salvat:\", clean_output_file)\n",
    "print(\"Comentarii salvate:\", len(comments_clean))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f461edf",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b60a7d2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c973005",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "47e547ac",
   "metadata": {},
   "source": [
    "15. Ce am obținut\n",
    "Am produs două fișiere:\n",
    "- `data/raw/student_XX_youtube_raw.jsonl` — comentarii brute\n",
    "- `data/cleaned/student_XX_youtube_clean.jsonl` — comentarii curățate\n",
    "Fișierul curățat va putea fi unit cu fișierele celorlalți membri ai echipei."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffa67a37",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (.venv)",
   "language": "python",
   "name": "echochamber"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
