웹크롤링(web crawling)_ request, beautifulSoup, 예제

WEB 2021. 5. 3. 09:48

resquest 사용법¶

In [17]:

#requests는 서버에 페이지 정보를 요청할때 사용하는 라이브러리
import requests as req#원래 라이브러리의 길이가 길어서 별칭으로 대체 'as'

In [18]:

naver_url = "http://www.naver.com"
res = req.get(naver_url)
#response[200]은 데이터가 잘 넘어왔다는 뜻.

In [19]:

res

Out[19]:

<Response [200]>

In [20]:

# 요청한 페이지의 정보를 텍스트 형태로 보고자 할때
res.text[:1000]

Out[20]:

'\n<!doctype html>                 <html lang="ko" data-dark="false"> <head> <meta charset="utf-8"> <title>NAVER</title> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=1190"> <meta name="apple-mobile-web-app-title" content="NAVER"/> <meta name="robots" content="index,nofollow"/> <meta name="description" content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요"/> <meta property="og:title" content="네이버"> <meta property="og:url" content="https://www.naver.com/"> <meta property="og:image" content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png"> <meta property="og:description" content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요"/> <meta name="twitter:card" content="summary"> <meta name="twitter:title" content=""> <meta name="twitter:url" content="https://www.naver.com/"> <meta name="twitter:image" content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png"> <meta name="twitter:description" content="네이버 메인에서 다양한 정'

(실습)Melon 홈페이지 정보 가져오기¶

In [21]:

melon_url = "https://www.melon.com/"
info_melon = req.get(melon_url)
info_melon
#코드406 >> 페이지가 요청을 거부하였음.

Out[21]:

<Response [406]>

In [22]:

#컴퓨터가 아닌 사람으로 속이는 작업, 사람인지 아닌지 식별
h = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}

In [23]:

#headers >> 위에서 설정해놓은 헤더값을 달아서 요청과 함께 보냄
info_melon = req.get(melon_url, headers = h)#headers는 고정값이므로 = 뒤에나오는 값만 바꿔야함.

In [24]:

info_melon.text[:1000]

Out[24]:

'<!DOCTYPE html>\r\n<html lang="ko">\r\n<head>\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\r\n\t<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>\r\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\r\n\t\r\n\r\n\t\r\n\r\n\t\r\n\r\n\t<title>Melon::음악이 필요한 순간, 멜론</title>\r\n\t<meta name="keywords" content="음악서비스, 멜론차트, 멜론TOP100, 최신음악, 인기가요, 뮤직비디오, 앨범, 플레이어, 스트리밍, 다운로드, 아티스트플러스, 아티스트채널" />\r\n\t<meta name="description" content="국내 최다 4,000만곡 보유, No.1 뮤직플랫폼 멜론! 최신 트렌드부터 나를 아는 똑똑한 음악추천까지!" />\r\n\t<meta name="naver-site-verification" content="e2b43191afa0f1d2deb8e2cda8f45ee1408c44a1"/>\r\n\t<meta property="fb:app_id" content="357952407588971"/>\r\n\t<meta property="og:title" content="Melon"/>\r\n\t<meta property="og:image" content="https://cdnimg.melon.co.kr/resource/image/web/common/logo_melon142x99.png"/>\r\n\t<meta property="og:description" content="음악이 필요한 순간, 멜론"/>\r\n\t<meta property="og:url" content="http://www.melon.com/"/>\r\n\t<meta property="og:type" content="website"/>\r\n\t<meta name="viewport" content="width=device-'

(실습)네이버에서 블로그라는 글자 가져오기¶

In [25]:

res = req.get(naver_url)

In [26]:

res.text[:1000]

Out[26]:

'\n<!doctype html>                 <html lang="ko" data-dark="false"> <head> <meta charset="utf-8"> <title>NAVER</title> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=1190"> <meta name="apple-mobile-web-app-title" content="NAVER"/> <meta name="robots" content="index,nofollow"/> <meta name="description" content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요"/> <meta property="og:title" content="네이버"> <meta property="og:url" content="https://www.naver.com/"> <meta property="og:image" content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png"> <meta property="og:description" content="네이버 메인에서 다양한 정보와 유용한 컨텐츠를 만나 보세요"/> <meta name="twitter:card" content="summary"> <meta name="twitter:title" content=""> <meta name="twitter:url" content="https://www.naver.com/"> <meta name="twitter:image" content="https://s.pstatic.net/static/www/mobile/edit/2016/0705/mobile_212852414260.png"> <meta name="twitter:description" content="네이버 메인에서 다양한 정'

In [27]:

naver_info = res.text

BeautifulSoup¶

--가져온 데이터에서 내가 원하는 내용만 추출할 때 사용

In [28]:

#bs4라는 폴더에 담겨져 있기 때문에 from명령어를 꼭 앞에 붙여야함.
from bs4 import BeautifulSoup as bs

In [29]:

#bs(어떤걸 가공할건지, 어떻게 가공할건지-파싱방법채택)
naver_list = bs(naver_info,'lxml')

In [30]:

#find_all: 리스트에 저장된 값중 특정소스코드 중 특정클래스(아이디명)를 모두가져옴
result = naver_list.find_all('a',class_ = "nav")#class는 예약어이기때문에 언더바 추가함.

In [31]:

result[2]

Out[31]:

<a class="nav" data-clk="svc.blog" href="https://section.blog.naver.com/">블로그</a>

In [32]:

#가지고온 데이터 중 순수 텍스트만 출력
result[2].text

Out[32]:

'블로그'

In [33]:

for index in result :
    print(index.text)    

메일
카페
블로그
지식iN
쇼핑
Pay
TV
사전
뉴스
증권
부동산
지도
영화
VIBE
책
웹툰

(실습)'코로나'로 검색해서 뉴스타이틀만 텍스트로 뽑아오기¶

In [34]:

covid_search_url = "https://search.naver.com/search.naver?sm=top_hty&fbm=0&ie=utf8&query=%EC%BD%94%EB%A1%9C%EB%82%98"

In [35]:

covid_info = req.get(covid_search_url)
req.get(covid_search_url)

Out[35]:

<Response [200]>

In [36]:

covid_text = covid_info.text

In [45]:

# beautiful soup 을 이용한 lxml parsing
covid_list = bs(covid_text,'lxml')

In [46]:

result_covid = covid_list.find_all('a', class_ = 'news_tit')
result_covid

Out[46]:

[<a class="news_tit" href="http://yna.kr/AKR20210503014951007?did=1195m" onclick="return goOtherCR(this, 'a=nws_all*a.tit&amp;r=1&amp;i=880000D8_000000000000000012368852&amp;g=001.0012368852&amp;u='+urlencode(this.href));" target="_blank" title="FC서울 황현수 코로나19 확진…선수단도 검사받고 대기(종합)">FC서울 황현수 <mark>코로나</mark>19 확진…선수단도 검사받고 대기(종합)</a>,
 <a class="news_tit" href="https://news.imaeil.com/Society/2021050309334773911" onclick="return goOtherCR(this, 'a=nws_all*h.tit&amp;r=4&amp;i=880000C1_000000000000000000701110&amp;g=088.0000701110&amp;u='+urlencode(this.href));" target="_blank" title="[속보] 코로나19 어제 488명 신규확진, 1주일만에 500명 아래">[속보] <mark>코로나</mark>19 어제 488명 신규확진, 1주일만에 500명 아래</a>,
 <a class="news_tit" href="http://www.hani.co.kr/arti/society/health/993598.html" onclick="return goOtherCR(this, 'a=nws_all*a.tit&amp;r=6&amp;i=88000103_000000000000000002542904&amp;g=028.0002542904&amp;u='+urlencode(this.href));" target="_blank" title="[속보] 코로나19 신규 확진자 488명…일주일 만에 400명대">[속보] <mark>코로나</mark>19 신규 확진자 488명…일주일 만에 400명대</a>,
 <a class="news_tit" href="https://news.sbs.co.kr/news/endPage.do?news_id=N1006305068&amp;plink=ORI&amp;cooper=NAVER" onclick="return goOtherCR(this, 'a=nws_all*a.tit&amp;r=8&amp;i=8800011C_000000000000000000891694&amp;g=055.0000891694&amp;u='+urlencode(this.href));" target="_blank" title="[속보] 코로나19 어제 488명 신규 확진, 1주일 만에 500명 아래…휴일 영향">[속보] <mark>코로나</mark>19 어제 488명 신규 확진, 1주일 만에 500명 아래…휴일 영향</a>]

In [47]:

for i in result_covid :
    print(i.text)

FC서울 황현수 코로나19 확진…선수단도 검사받고 대기(종합)
[속보] 코로나19 어제 488명 신규확진, 1주일만에 500명 아래
[속보] 코로나19 신규 확진자 488명…일주일 만에 400명대
[속보] 코로나19 어제 488명 신규 확진, 1주일 만에 500명 아래…휴일 영향

'WEB' 카테고리의 다른 글

웹크롤링(web crawling)_영화관람객 리뷰 수집 (0)	2021.05.03
웹크롤링(web crawling)_20210506기준 네이버영화 랭킹페이지 (0)	2021.05.03
웹크롤링(web crawling) 입문,기본 (0)	2021.05.03
css 기본_2 (stylesheet 명시도, 공간분할 기타 등..) (0)	2021.05.03
css 기본_1 (기본css, 선택자) (0)	2021.05.03

크게되고 싶은 개발자의 공부노트

웹크롤링(web crawling)_ request, beautifulSoup, 예제

resquest 사용법¶

(실습)Melon 홈페이지 정보 가져오기¶

(실습)네이버에서 블로그라는 글자 가져오기¶

BeautifulSoup¶

(실습)'코로나'로 검색해서 뉴스타이틀만 텍스트로 뽑아오기¶

'WEB' 카테고리의 다른 글

글갈래

공지

새글

댓글

즐겨찾기

글 보관함

인기글

방문자수

티스토리툴바