[Python] 웹 스크랩핑, 웹 크롤링, BeautifulSoup, parser

공부한 내용 정리하는 공간입니다.

틀린 내용이 있을 수 있습니다.

모든 지적, 첨언 환영합니다.

오늘의 코드

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <div data-role="page" data-last-modified="2022-01-01" data-foo="value">This is a div with data attributes.</div>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2" data-info="more info">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3" data-info="even more info">Tillie abcd</a>
    ; and they lived at the bottom of a well.</p>
    <p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.text)
print(soup.p)
print(soup.p['class'])

웹 스크래핑

웹 사이트에서 데이터를 자동으로 추출하는 과정

웹 페이지의 HTML 코드에서 필요한 정보를 가져오고 이를 가공하여 유용한 형태로 변환하는 작업

BeautifulSoup 라이브러리 사용

유의 사항

1. 웹 사이트 규정 확인

2. 과도한 요청 피하기

3. 저작권 및 이용약관 준수하기

웹 크롤링

인터넷 상의 여러 웹 페이지를 자동으로 탐색하여 데이터를 수집하는 과정

특정 규칙에 따라 웹사이트를 돌아다니며 데이터를 가져오고, 후속 작업을 위해 분석함

주로 검색 엔진 최적화(SEO)나 다른 웹 서비스의 기반 데이터로 사용

BeautifulSoup

HTML과 XML 문서를 파싱하고, 웹 페이지에서 원하는 데이터를 쉽게 추출할 수 있게 도와주는 라이브러리

복잡하고 규칙이 없는 웹의 데이터를 구조화된 형태로 변환

HTML/XML 파싱 : 문서 내의 특정 요소를 쉽게 탐색하고 추출

문서 탐색 : 태그, 속성, 텍스트 등 다양한 요소를 탐색하고 필터링

수정 : 문서 내의 요소를 수정하거나 삭제

CSS 선택자 및 XPath 지원 : CSS 선택자나 XPath를 사용해 찾을 수 있음

BeautifulSoup(HTML문서, '파서') 형태로 사용

파서(parser)

주어진 입력을 분석하고, 이를 구조화된 데이터로 변환하는 도구

html.parser, lxml, html5lib 등이 있음

lxml, html5lib은 사용 전 설치 필요

.title

HTML 문서의 <title> 태그

<title>안녕하세요</title> 일 때 <title>안녕하세요</title> 을 반환

.title.name

HTML 문서의 <title> 태그 이름

<title>안녕하세요</title> 일 때 title 을 반환

.title.string, .title.text

HTML 문서의 <title> 태그 안에 들어 있는 문자열

<title>안녕하세요</title> 일 때 안녕하세요 를 반환

.p

HTML 문서의 첫 번째 태그

안녕하세요 안녕히가세요 일 때 안녕하세요 를 반환

.p['class']

HTML 문서의 첫 번째 태그의 class 속성 값

class 값을 리스트로 반환

오늘의 코드

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <div data-role="page" data-last-modified="2022-01-01" data-foo="value">This is a div with data attributes.</div>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2" data-info="more info">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3" data-info="even more info">Tillie abcd</a>
    ; and they lived at the bottom of a well.</p>
    <p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.text)
print(soup.p)
print(soup.p['class'])

오늘의 코드 결과

오늘의 코드 설명

soup = BeautifulSoup(html_doc, 'html.parser')

문서 html_doc을 html.parser로 파싱해서 soup에 저장

soup을 태그 탐색, 수정을 통해 크롤링 및 데이터 추출 작업에 사용

print(soup.title)

soup의 title 태그 출력

>soup.title=<title>The Dormouse's story</title>

print(soup.title.name)
print(soup.title.string)
print(soup.title.text)

soup의 title 태그의 이름, 문자열 출력

>soup.title.name=title

>soup.title.string=The Dormouse's story

>soup.title.text=The Dormouse's story

print(soup.p)

soup의 첫 번째 p 태그 출력

>soup.p=The Dormouse's story

print(soup.p['class'])

soup의 첫 번째 p 태그의 class 속성 값 출력

>soup.p['class']=['title']

저작자표시 비영리 변경금지 (새창열림)

'클라우드기반 스마트 융합보안 과정 > Python' 카테고리의 다른 글

[Python] requests.get(), .select(), with open() as file: (0)	2025.01.15
[Python] requests, .get(), .find_all(), .select() (0)	2025.01.15
[Python] 환경변수, .env 파일, os.getenv() (0)	2025.01.15
[Python] MIMEText(), .['Subject'], .['from'], .['To'], .as_string(), smtplib.SMTP(), .starttls(), .login(), .sendmail(), .quit() (0)	2025.01.15
[Python] set(), .write() (1)	2025.01.15