在網路爬蟲的練習中,有時候會因為許多問題導致爬蟲套件失效,像是 R 文字探勘 - 網頁爬蟲 文中的範例,因為 Bloomberg 的網頁更改編寫方式,造成 Rurl 的網路安全性版本不相符的問題,導致 Rcurl 無法順利抓到資料。在這裡介紹另外一個 R 的爬蟲套件 rvest 提供大家另外一種爬蟲的選擇,同時也更新了 R 文字探勘 - 網頁爬蟲 的示範練習,讓大家體驗一下兩種套件在爬蟲上的異同。
套件 rvest
rvest 套件使用上跟 Rcurl + XML 很類似,使用 html_read() 讀取網頁資料,接著使用 html_nodes() 指定想要抓取元素的元素,在這邊的指定元素建議使用瀏覽器 Chrome 進入網頁後,對想要抓取的資料點選右鍵 → 檢查,可以看到該 div 的 class,class 的元素就是我們想要抓取的元素的名字,指定路徑後使用 html_text() 該元素的文字。
R 實作
在這裡一樣示範抓取 Bloomberg 彭博社 market 新聞的標題。
html.data <- read_html("https://www.bloomberg.com/markets") html.path <- html_nodes(page.source, ".story-package-module__story__headline-link") text.data <- html_text(html.path) # 將 text.data 整理過後就是我們想要的新聞標題。 gsub('\n ', '', text.data) [1] " China Dethroned by Japan as World's Second-Biggest Stock Market" [2] " Eurozone Economy Enters Third Quarter With No Pickup in Sight" [3] " Carney Jumps Back Into Brexit Debate With No-Deal Risk Warning" [4] " Italy Yield Tops 3% Before Budget Meeting as Bonds Extend Slump" [5] " What Economists Are Saying Ahead of Friday's U.S. Jobs Report" [6] " A Trillion-Dollar Apple Isn't Like Other Tech Giants When It Comes to Valuation" [7] " William Hill's U.K. Troubles Make U.S. Expansion More Urgent" [8] " RBS Lays Out Plan to Resume Dividends a Decade After Bailout" [9] " Heineken Takes On AB InBev in China With $3 Billion Deal" [10] " Eskom Unions Are Said Likely to Accept New Bonus Pay Offer" [11] " Mondi of South Africa Jumps as Consumers Shun Plastic for Paper" [12] " Jesus Is Staying at Manchester City Longer Than Expected" [13] " Gold Rout Takes Prices Near $1,200 as Investors Favor Dollar" [14] " Corporate Bonds are Finally Waking Up to Britain's Brexit Risks" [15] " Kotak's `Interesting Move' May Remove Overhang From Bank Shares" [16] " Current Bout of Rand Volatility May Blow Over, Options Suggest" [17] " Cooling Vests for Hot Dogs Help Lead Pets at Home Shares Higher" [18] " Bitcoin Needs to Hit $213,000 to Replace Money Supply, UBS Says" [19] " Brokers’ Cryptocurrency Deals Are Focus of SEC Review" [20] " Crypto Bulls Pile Into ICOs at Record Pace Despite Bitcoin Rout" [21] " Long Blockchain Gets Hit With SEC Subpoena After Nasdaq Ouster" [22] " Lira Pares Decline as Inflation Accelerates Less Than Forecast" [23] " Swiss Re Profit Disappoints as Pricing Pressure Hurts Reinsurers" [24] " Monte Paschi Keeps Up Profit Momentum Despite Hit on State Bonds"
結論
在使用 R 爬蟲的時候,會遇到類似的事情,如果遇到的狀況是因為套件,不妨多多查詢其他套件,很多時候可以找到類似性質的套件解決相同問題,因為 R 最讓人稱讚的就是套件資源豐富了。