html - Scrape a page with JavaScript from R -
i new web scraping in r , have ran problem sites reference javascript. attempting scrape data web page below , have been unsuccessful. believe javascript links prevent me accessing table. result r package "xml" function "readhtmltable" comes null.
library(xml) library(rcurl) url <- "http://votingrights.news21.com/interactive/movement-voter-id/index.html" tabs <- geturl(url) tabs <- htmlparse(url) tabs <- readhtmltable(tabs, stringsasfactors = false)
how can access javascript links data? or possible? when using direct link data (below) , r package "rjson" still unable read in data.
library("rjson") json_file <- "http://votingrights.news21.com/static/interactives/movement/data/fulldata.js" lines <- readlines(json_file) json_data <- fromjson(lines, collapse="")
the file reference javascript file containing json rather json. in case can manually scrub contents data:
library("rjson") json_file <- "http://votingrights.news21.com/static/interactives/movement/data/fulldata.js" lines <- readlines(json_file) lines[1] <- sub(".* = (.*)", "\\1", lines[1]) lines[length(lines)] <- sub(";", "", lines[length(lines)]) json_data <- fromjson(paste(lines, collapse="\n")) > head(json_data[[1]][[1]]) $state [1] "alabama" $bill [1] "hb 19" $category [1] "strict photo id" $introduced [1] "mar 1, 2011" $house [1] "yes" $senate [1] "yes"
if want interact javascript data on webpage can use selenium:
library(rselenium) appurl <- "http://votingrights.news21.com/static/interactives/movement/index.html" pjs <- phantom() remdr <- remotedriver(browsername = "phantom") remdr$open() remdr$navigate(appurl) fulldata <- remdr$executescript("return fulldata;") pjs$stop() > head(fulldata[[1]][[1]]) $state [1] "alabama" $bill [1] "hb 19" $category [1] "strict photo id" $introduced [1] "mar 1, 2011" $house [1] "yes" $senate [1] "yes"
Comments
Post a Comment