网站首页 > 教程分享 正文
序
本文主要研究一下Spring AI Alibaba的YoutubeDocumentReader
YoutubeDocumentReader
community/document-readers/spring-ai-alibaba-starter-document-reader-youtube/src/main/java/com/alibaba/cloud/ai/reader/youtube/YoutubeDocumentReader.java
public class YoutubeDocumentReader implements DocumentReader {
private static final String WATCH_URL = "https://www.youtube.com/watch?v=%s";
private final ObjectMapper objectMapper;
private static final List<String> YOUTUBE_URL_PATTERNS = List.of("youtube\\.com/watch\\?v=([^&]+)",
"youtu\\.be/([^?&]+)");
private final String resourcePath;
private static final int MEMORY_SIZE = 5;
private static final int BYTE_SIZE = 1024;
private static final int MAX_MEMORY_SIZE = MEMORY_SIZE * BYTE_SIZE * BYTE_SIZE;
private static final WebClient WEB_CLIENT = WebClient.builder()
.defaultHeader("Accept-Language", "en-US")
.codecs(configurer -> configurer.defaultCodecs().maxInMemorySize(MAX_MEMORY_SIZE))
.build();
public YoutubeDocumentReader(String resourcePath) {
Assert.hasText(resourcePath, "Query string must not be empty");
this.resourcePath = resourcePath;
this.objectMapper = new ObjectMapper();
}
@Override
public List<Document> get() {
List<Document> documents = new ArrayList<>();
try {
String videoId = extractVideoIdFromUrl(resourcePath);
String subtitleContent = getSubtitleInfo(videoId);
documents.add(new Document(StringEscapeUtils.unescapeHtml4(subtitleContent)));
}
catch (IOException e) {
throw new RuntimeException("Failed to load document from Youtube: {}", e);
}
return documents;
}
// Method to extract the videoId from the resourcePath
public String extractVideoIdFromUrl(String resourcePath) {
for (String pattern : YOUTUBE_URL_PATTERNS) {
Pattern regexPattern = Pattern.compile(pattern);
Matcher matcher = regexPattern.matcher(resourcePath);
if (matcher.find()) {
return matcher.group(1); // Extract the videoId (captured group)
}
}
throw new IllegalArgumentException("Invalid YouTube URL: Unable to extract videoId.");
}
public String getSubtitleInfo(String videoId) throws IOException {
// Step 1: Fetch the HTML content of the YouTube video page
String url = String.format(WATCH_URL, videoId);
String htmlContent = fetchHtmlContent(url).block(); // Blocking for simplicity in
// this example
// Step 2: Extract the subtitle tracks from the HTML
String captionsJsonString = extractCaptionsJson(htmlContent);
if (captionsJsonString != null) {
JsonNode captionsJson = objectMapper.readTree(captionsJsonString);
JsonNode captionTracks = captionsJson.path("playerCaptionsTracklistRenderer").path("captionTracks");
// Check if captionTracks exists and is an array
if (captionTracks.isArray()) {
// Step 3: Extract and decode each subtitle track's URL
StringBuilder subtitleInfo = new StringBuilder();
JsonNode captionTrack = captionTracks.get(0);
// Safely access languageCode and baseUrl with null checks
String language = captionTrack.path("languageCode").asText("Unknown");
String urlEncoded = captionTrack.path("baseUrl").asText("");
// Decode the URL to avoid \u0026 issues
String decodedUrl = URLDecoder.decode(urlEncoded, StandardCharsets.UTF_8);
String subtitleText = fetchSubtitleText(decodedUrl);
subtitleInfo.append("Language: ").append(language).append("\n").append(subtitleText).append("\n\n");
return subtitleInfo.toString();
}
else {
return "No captions available.";
}
}
else {
return "No captions data found.";
}
}
private Mono<String> fetchHtmlContent(String url) {
// Use WebClient to fetch HTML content asynchronously
return WEB_CLIENT.get().uri(url).retrieve().bodyToMono(String.class);
}
private String extractCaptionsJson(String htmlContent) {
// Extract the captions JSON from the HTML content
String marker = "\"captions\":";
int startIndex = htmlContent.indexOf(marker);
if (startIndex != -1) {
int endIndex = htmlContent.indexOf("\"videoDetails", startIndex);
if (endIndex != -1) {
String captionsJsonString = htmlContent.substring(startIndex + marker.length(), endIndex);
return captionsJsonString.trim();
}
}
return null;
}
private String fetchSubtitleText(String decodedUrl) throws IOException {
// Fetch the subtitle text by making a request to the decoded subtitle URL
org.jsoup.nodes.Document doc = Jsoup.connect(decodedUrl).get();
// Assuming the subtitle text is inside <transcript> tags, extract the text
StringBuilder subtitleText = new StringBuilder();
doc.select("text").forEach(textNode -> {
String text = textNode.text();
subtitleText.append(text).append("\n");
});
return subtitleText.toString();
}
}
YoutubeDocumentReader构造器要求输入resourcePath,它内置了WebClient,其get方法先通过extractVideoIdFromUrl获取videoId,再通过getSubtitleInfo获取字幕,最后组装为List<Document>返回;getSubtitleInfo通过请求
https://www.youtube.com/watch?v=videoId,之后解析html内容获取videoDetails内容,再json解析提取language、subtitleText
示例
community/document-readers/spring-ai-alibaba-starter-document-reader-youtube/src/test/java/com/alibaba/cloud/ai/reader/youtube/YoutubeDocumentReaderTest.java
public class YoutubeDocumentReaderTest {
private static final Logger logger = LoggerFactory.getLogger(YoutubeDocumentReaderTest.class);
@Test
void youtubeDocumentReaderTest() {
YoutubeDocumentReader youtubeDocumentReader = new YoutubeDocumentReader(
"https://www.youtube.com/watch?v=q-9wxg9tQRk");
List<Document> documents = youtubeDocumentReader.get();
logger.info("documents: {}", documents);
}
}
小结
spring-ai-alibaba-starter-document-reader-youtube提供了YoutubeDocumentReader,它通过webClient去请求指定url,提取字幕的language以及字幕内容,最后组装为List<Document>返回。
doc
- java2ai
猜你喜欢
- 2025-05-08 详解Xss 及SpringBoot 防范Xss攻击(附全部代码)
- 2025-05-08 jsoup 抓取 iteye 网站(抓取app的url)
- 2025-05-08 Spring boot + Jsoup 搭建,解析系统接口只需1分钟
- 2025-05-08 Java的优势:跨平台只是一部分(java 优势)
- 2025-05-08 [Jsoup] HTML解析器,轻松获取网页内容
- 2025-05-08 Spring Boot集成jsoup实现html解析
你 发表评论:
欢迎- 最近发表
-
- IT之家学院:使用PIN或密码审批管理员权限
- Yarn 安装的时候提示错误 error:0308010C:digital envelope routines
- Windows常用的一些CMD运行命令(windows常见的命令)
- 电脑忘记开机密码10秒解决(戴尔电脑忘记开机密码10秒解决)
- 如何下载Windows 10聚焦提供的锁屏壁纸
- Windows CMD 命令大全:简单粗暴收藏!
- 系统小技巧:解决CHKDSK只读模式问题
- Windows的cmd都有哪些奇技淫巧?这22个CMD命令记得收藏起来!
- windows错误代码0x80072EE2?win10系统更新错误问题的处理方法
- Windows 10技术预览版快捷键方式汇总
- 标签列表
-
- css导航条 (66)
- sqlinsert (63)
- js提交表单 (60)
- param (62)
- parentelement (65)
- jquery分享 (62)
- check约束 (64)
- curl_init (68)
- sql if语句 (69)
- import (66)
- chmod文件夹 (71)
- clearinterval (71)
- pythonrange (62)
- 数组长度 (61)
- javafx (59)
- 全局消息钩子 (64)
- sort排序 (62)
- jdbc (69)
- php网页源码 (59)
- assert h (69)
- httpclientjar (60)
- postgresql conf (59)
- winform开发 (59)
- mysql数字类型 (71)
- drawimage (61)
本文暂时没有评论,来添加一个吧(●'◡'●)