Title: | Extract Data Tables and Comments from 'Microsoft' 'Word' Documents |
---|---|
Description: | 'Microsoft Word' 'docx' files provide an 'XML' structure that is fairly straightforward to navigate, especially when it applies to 'Word' tables and comments. Tools are provided to determine table count/structure, comment count and also to extract/clean tables and comments from 'Microsoft Word' 'docx' documents. There is also nascent support for '.doc' files. |
Authors: | Bob Rudis [aut, cre] , Mark Dulhunty [ctb], Karlo Guidoni-Martins [ctb], Chris Muir [aut, ctb] |
Maintainer: | Bob Rudis <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.6.5 |
Built: | 2025-01-05 02:46:37 UTC |
Source: | https://gitlab.com/hrbrmstr/docxtractr |
Many tables in Word documents are in twisted formats where there may be
labels or other oddities mixed in that make it difficult to work with the
underlying data. This function makes it easy to identify a particular row
in a scraped data.frame
as the one containing column names and
have it become the column names, removing it and (optionally) all of the
rows before it (since that's usually what needs to be done).
assign_colnames(dat, row, remove = TRUE, remove_previous = remove)
assign_colnames(dat, row, remove = TRUE, remove_previous = remove)
dat |
can be any |
row |
numeric value indicating the row number that is to become the column names |
remove |
remove row specified by |
remove_previous |
remove any rows preceding |
data.frame
docx_extract_all
, docx_extract_tbl
# a "real" Word doc real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) docx_tbl_count(real_world) # get all the tables tbls <- docx_extract_all_tbls(real_world) # make table 1 better assign_colnames(tbls[[1]], 2) # make table 5 better assign_colnames(tbls[[5]], 2)
# a "real" Word doc real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) docx_tbl_count(real_world) # get all the tables tbls <- docx_extract_all_tbls(real_world) # make table 1 better assign_colnames(tbls[[1]], 2) # make table 5 better assign_colnames(tbls[[5]], 2)
Convert a Document (usually PowerPoint) to a PDF
convert_to_pdf(path, pdf_file = sub("[.]pptx", ".pdf", path))
convert_to_pdf(path, pdf_file = sub("[.]pptx", ".pdf", path))
path |
path to the document, can be PowerPoint or DOCX |
pdf_file |
output PDF file name. By default, creates a PDF in the
same directory as the |
## Not run: path = system.file("examples/ex.pptx", package="docxtractr") pdf <- convert_to_pdf(path, pdf_file = tempfile(fileext = ".pdf")) path = system.file("examples/data.docx", package="docxtractr") pdf_doc <- convert_to_pdf(path, pdf_file = tempfile(fileext = ".pdf")) ## End(Not run)
## Not run: path = system.file("examples/ex.pptx", package="docxtractr") pdf <- convert_to_pdf(path, pdf_file = tempfile(fileext = ".pdf")) path = system.file("examples/data.docx", package="docxtractr") pdf_doc <- convert_to_pdf(path, pdf_file = tempfile(fileext = ".pdf")) ## End(Not run)
Get number of comments in a Word document
docx_cmnt_count(docx)
docx_cmnt_count(docx)
docx |
|
numeric
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr")) docx_cmnt_count(cmnts)
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr")) docx_cmnt_count(cmnts)
Returns information about the comments in the Word document
docx_describe_cmnts(docx)
docx_describe_cmnts(docx)
docx |
|
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr")) docx_cmnt_count(cmnts) docx_describe_cmnts(cmnts)
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr")) docx_cmnt_count(cmnts) docx_describe_cmnts(cmnts)
This function will attempt to discern the structure of each of the tables
in docx
and print this information
docx_describe_tbls(docx)
docx_describe_tbls(docx)
docx |
|
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr")) docx_tbl_count(complx) docx_describe_tbls(complx)
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr")) docx_tbl_count(complx) docx_describe_tbls(complx)
Extract all tables from a Word document
docx_extract_all(docx, guess_header = TRUE, preserve = FALSE, trim = TRUE)
docx_extract_all(docx, guess_header = TRUE, preserve = FALSE, trim = TRUE)
docx |
|
guess_header |
should the function make a guess as to the existence of
a header in a table? (Default: |
preserve |
preserve line breaks within a cell? Default: 'FALSE'. NOTE: This overrides 'trim'. |
trim |
trim leading/trailing whitespace (if any) in cells? (default: |
list
of data.frame
s or an empty list
if no
tables exist in docx
assign_colnames
, docx_extract_tbl
# a "real" Word doc real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) docx_tbl_count(real_world) # get all the tables tbls <- docx_extract_all_tbls(real_world)
# a "real" Word doc real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) docx_tbl_count(real_world) # get all the tables tbls <- docx_extract_all_tbls(real_world)
Extract all comments from a Word document
docx_extract_all_cmnts(docx, include_text = FALSE)
docx_extract_all_cmnts(docx, include_text = FALSE)
docx |
|
include_text |
if |
data_frame
of comment id, author & text
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr")) docx_cmnt_count(cmnts) docx_describe_cmnts(cmnts) docx_extract_all_cmnts(cmnts)
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr")) docx_cmnt_count(cmnts) docx_describe_cmnts(cmnts) docx_extract_all_cmnts(cmnts)
Extract all tables from a Word document
docx_extract_all_tbls(docx, guess_header = TRUE, preserve = FALSE, trim = TRUE)
docx_extract_all_tbls(docx, guess_header = TRUE, preserve = FALSE, trim = TRUE)
docx |
|
guess_header |
should the function make a guess as to the existence of
a header in a table? (Default: |
preserve |
preserve line breaks within a cell? Default: 'FALSE'. NOTE: This overrides 'trim'. |
trim |
trim leading/trailing whitespace (if any) in cells? (default: |
list
of data.frame
s or an empty list
if no
tables exist in docx
assign_colnames
, docx_extract_tbl
# a "real" Word doc real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) docx_tbl_count(real_world) # get all the tables tbls <- docx_extract_all_tbls(real_world)
# a "real" Word doc real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) docx_tbl_count(real_world) # get all the tables tbls <- docx_extract_all_tbls(real_world)
Given a document read with read_docx
and a table to extract (optionally
indicating whether there was a header or not and if cell whitepace trimming is
desired) extract the contents of the table to a data.frame
.
docx_extract_tbl( docx, tbl_number = 1, header = TRUE, preserve = FALSE, trim = TRUE )
docx_extract_tbl( docx, tbl_number = 1, header = TRUE, preserve = FALSE, trim = TRUE )
docx |
|
tbl_number |
which table to extract (defaults to |
header |
assume first row of table is a header row? (default; |
preserve |
preserve line breaks within a cell? Default: |
trim |
trim leading/trailing whitespace (if any) in cells? (default: |
data.frame
docx_extract_all
, docx_extract_tbl
,
assign_colnames
doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr")) docx_extract_tbl(doc3, 3) intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr")) docx_extract_tbl(intracell_whitespace, 2, preserve=FALSE) docx_extract_tbl(intracell_whitespace, 2, preserve=TRUE)
doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr")) docx_extract_tbl(doc3, 3) intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr")) docx_extract_tbl(intracell_whitespace, 2, preserve=FALSE) docx_extract_tbl(intracell_whitespace, 2, preserve=TRUE)
Get number of tables in a Word document
docx_tbl_count(docx)
docx_tbl_count(docx)
docx |
|
numeric
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr")) docx_tbl_count(complx)
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr")) docx_tbl_count(complx)
Microsoft Word 'docx“ files provide an XML structure that is fairly straightforward to navigate, especially when it applies to Word tables. The 'docxtractr“ package provides tools to determine table count + table structure and extract tables from Microsoft Word docx documents. It also provides tools to determine comment count and extract comments from Word 'docx“ documents.
Bob Rudis ([email protected])
Remove punctuation and spaces and turn them to underscores plus convert to lower case.
mcga(tbl)
mcga(tbl)
tbl |
a |
whatver class x
was but with truly great, really great column names. They're amazing.
Trust me. They'll be incredible column names once we're done.
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) tbls <- docx_extract_all_tbls(real_world) mcga(assign_colnames(tbls[[1]], 2))
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr")) tbls <- docx_extract_all_tbls(real_world) mcga(assign_colnames(tbls[[1]], 2))
Display information about the document
## S3 method for class 'docx' print(x, ...)
## S3 method for class 'docx' print(x, ...)
x |
|
... |
ignored |
Local file path or URL pointing to a .docx
file. Can also take
.doc
file as input if LibreOffice
is installed
(see https://www.libreoffice.org/ for more info and to download).
read_docx(path, track_changes = NULL)
read_docx(path, track_changes = NULL)
path |
path to the Word document |
track_changes |
if not |
doc <- read_docx(system.file("examples/data.docx", package="docxtractr")) class(doc) doc <- read_docx( system.file("examples/trackchanges.docx", package="docxtractr"), track_changes = "accept" ) ## Not run: # from a URL budget <- read_docx( "http://rud.is/dl/1.DOCX") ## End(Not run)
doc <- read_docx(system.file("examples/data.docx", package="docxtractr")) class(doc) doc <- read_docx( system.file("examples/trackchanges.docx", package="docxtractr"), track_changes = "accept" ) ## Not run: # from a URL budget <- read_docx( "http://rud.is/dl/1.DOCX") ## End(Not run)
Function to set an option that points to the local LibreOffice file
soffice.exe
.
set_libreoffice_path(path)
set_libreoffice_path(path)
path |
path to the LibreOffice soffice file |
For a list of possible file path locations for soffice.exe
,
see https://github.com/hrbrmstr/docxtractr/issues/5#issuecomment-233181976
Returns nothing, function sets the option variable
path_to_libreoffice
.
## Not run: set_libreoffice_path("local/path/to/soffice.exe") ## End(Not run)
## Not run: set_libreoffice_path("local/path/to/soffice.exe") ## End(Not run)