R 文字探勘 - 網頁爬蟲(2)

在網路爬蟲的練習中,有時候會因為許多問題導致爬蟲套件失效,像是 R 文字探勘 - 網頁爬蟲 文中的範例,因為 Bloomberg 的網頁更改編寫方式,造成 Rurl 的網路安全性版本不相符的問題,導致 Rcurl 無法順利抓到資料。在這裡介紹另外一個 R 的爬蟲套件 rvest 提供大家另外一種爬蟲的選擇,同時也更新了 R 文字探勘 - 網頁爬蟲 的示範練習,讓大家體驗一下兩種套件在爬蟲上的異同。

套件 rvest

rvest 套件使用上跟 Rcurl + XML 很類似,使用 html_read() 讀取網頁資料,接著使用 html_nodes() 指定想要抓取元素的元素,在這邊的指定元素建議使用瀏覽器 Chrome 進入網頁後,對想要抓取的資料點選右鍵 → 檢查,可以看到該 div 的 class,class 的元素就是我們想要抓取的元素的名字,指定路徑後使用 html_text() 該元素的文字。

R 實作

在這裡一樣示範抓取 Bloomberg 彭博社 market 新聞的標題。

html.data <- read_html("https://www.bloomberg.com/markets")
html.path <- html_nodes(page.source, ".story-package-module__story__headline-link")
text.data <- html_text(html.path)

# 將 text.data 整理過後就是我們想要的新聞標題。
gsub('\n            ', '', text.data)

 [1] "    China Dethroned by Japan as World's Second-Biggest Stock Market"                
 [2] "    Eurozone Economy Enters Third Quarter With No Pickup in Sight"                  
 [3] "    Carney Jumps Back Into Brexit Debate With No-Deal Risk Warning"                 
 [4] "    Italy Yield Tops 3% Before Budget Meeting as Bonds Extend Slump"                
 [5] "    What Economists Are Saying Ahead of Friday's U.S. Jobs Report"                  
 [6] "    A Trillion-Dollar Apple Isn't Like Other Tech Giants When It Comes to Valuation"
 [7] "    William Hill's U.K. Troubles Make U.S. Expansion More Urgent"                   
 [8] "    RBS Lays Out Plan to Resume Dividends a Decade After Bailout"                   
 [9] "    Heineken Takes On AB InBev in China With $3 Billion Deal"                       
[10] "    Eskom Unions Are Said Likely to Accept New Bonus Pay Offer"                     
[11] "    Mondi of South Africa Jumps as Consumers Shun Plastic for Paper"                
[12] "    Jesus Is Staying at Manchester City Longer Than Expected"                       
[13] "    Gold Rout Takes Prices Near $1,200 as Investors Favor Dollar"                   
[14] "    Corporate Bonds are Finally Waking Up to Britain's Brexit Risks"                
[15] "    Kotak's `Interesting Move' May Remove Overhang From Bank Shares"                
[16] "    Current Bout of Rand Volatility May Blow Over, Options Suggest"                 
[17] "    Cooling Vests for Hot Dogs Help Lead Pets at Home Shares Higher"                
[18] "    Bitcoin Needs to Hit $213,000 to Replace Money Supply, UBS Says"                
[19] "    Brokers’ Cryptocurrency Deals Are Focus of SEC Review"                          
[20] "    Crypto Bulls Pile Into ICOs at Record Pace Despite Bitcoin Rout"                
[21] "    Long Blockchain Gets Hit With SEC Subpoena After Nasdaq Ouster"                 
[22] "    Lira Pares Decline as Inflation Accelerates Less Than Forecast"                 
[23] "    Swiss Re Profit Disappoints as Pricing Pressure Hurts Reinsurers"               
[24] "    Monte Paschi Keeps Up Profit Momentum Despite Hit on State Bonds"

結論

在使用 R 爬蟲的時候,會遇到類似的事情,如果遇到的狀況是因為套件,不妨多多查詢其他套件,很多時候可以找到類似性質的套件解決相同問題,因為 R 最讓人稱讚的就是套件資源豐富了。

發表迴響