sfarrow
is designed to help read/write spatial vector data in “simple feature” format from/to Parquet files while maintaining coordinate reference system information. Essentially, this tool is attempting to connect R
objects in sf
and in arrow
and it relies on these packages for its internal work.
A key goal is to support interoperability of spatial data in Parquet files. R objects (including sf
) can be written to files with arrow
; however, these do not necessarily maintain the spatial information or can be read in by Python. sfarrow
implements a metadata format also used by Python GeoPandas
, described here: https://github.com/geopandas/geo-arrow-spec. Note that these metadata are not stable yet, and sfarrow
will warn you that it may change.
# install from CRAN with install.packages('sfarrow')
# or install from devtools::install_github("wcjochem/sfarrow@main)
# load the library
library(sfarrow)
library(dplyr, warn.conflicts = FALSE)
A Parquet file (with .parquet
extension) can be read using st_read_parquet()
and pointing to the file system. This will create an sf
spatial data object in memory which can then be used as normal using functions from sf
.
# read an example dataset created from Python using geopandas
world <- st_read_parquet(system.file("extdata", "world.parquet", package = "sfarrow"))
class(world)
#> [1] "sf" "data.frame"
world
#> Simple feature collection with 177 features and 5 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
#> Geodetic CRS: WGS 84
#> First 10 features:
#> pop_est continent name iso_a3 gdp_md_est
#> 1 920938 Oceania Fiji FJI 8.374e+03
#> 2 53950935 Africa Tanzania TZA 1.506e+05
#> 3 603253 Africa W. Sahara ESH 9.065e+02
#> 4 35623680 North America Canada CAN 1.674e+06
#> 5 326625791 North America United States of America USA 1.856e+07
#> 6 18556698 Asia Kazakhstan KAZ 4.607e+05
#> 7 29748859 Asia Uzbekistan UZB 2.023e+05
#> 8 6909701 Oceania Papua New Guinea PNG 2.802e+04
#> 9 260580739 Asia Indonesia IDN 3.028e+06
#> 10 44293293 South America Argentina ARG 8.794e+05
#> geometry
#> 1 MULTIPOLYGON (((180 -16.067...
#> 2 POLYGON ((33.90371 -0.95, 3...
#> 3 POLYGON ((-8.66559 27.65643...
#> 4 MULTIPOLYGON (((-122.84 49,...
#> 5 MULTIPOLYGON (((-122.84 49,...
#> 6 POLYGON ((87.35997 49.21498...
#> 7 POLYGON ((55.96819 41.30864...
#> 8 MULTIPOLYGON (((141.0002 -2...
#> 9 MULTIPOLYGON (((141.0002 -2...
#> 10 MULTIPOLYGON (((-68.63401 -...
plot(sf::st_geometry(world))
Similarly, a Parquet file can be written from an sf
object using st_write_parquet()
and specifying a path to the new file. Non-spatial objects cannot be written with sfarrow
, and users should instead use arrow
.
# output the file to a new location
# note the warning about possible future changes in metadata.
st_write_parquet(world, dsn = file.path(tempdir(), "new_world.parquet"))
#> Warning: This is an initial implementation of Parquet/Feather file support and
#> geo metadata. This is tracking version 0.1.0 of the metadata
#> (https://github.com/geopandas/geo-arrow-spec). This metadata
#> specification may change and does not yet make stability promises. We
#> do not yet recommend using this in a production setting unless you are
#> able to rewrite your Parquet/Feather files.
While reading/writing a Parquet file is nice, the real power of arrow
comes from splitting big datasets into multiple files, or partitions, based on criteria that make it faster to query. There is currently basic support in sfarrow
for multi-file spatial datasets. For additional dataset querying options, see the arrow
documentation.
sfarrow
accesses arrows
’s dplyr
interface to explore partitioned, Arrow datasets.
For this example we will use a dataset which was created by randomly splitting the nc.shp file first into three groups and then further partitioning into two more random groups. This creates a nested set of files.
list.files(system.file("extdata", "ds", package = "sfarrow"), recursive = TRUE)
#> [1] "split1=1/split2=1/part-3.parquet" "split1=1/split2=2/part-0.parquet"
#> [3] "split1=2/split2=1/part-1.parquet" "split1=2/split2=2/part-5.parquet"
#> [5] "split1=3/split2=1/part-2.parquet" "split1=3/split2=2/part-4.parquet"
The file tree is showing that the data were partitioned by the variables “split1” and “split2”. Those are the column names that were used for the random splits. This partitioning is in “Hive style” where the partitioning variables are in the paths.
The first step is to open the Dataset using arrow
.
ds <- arrow::open_dataset(system.file("extdata", "ds", package="sfarrow"))
For small datasets (as in the example) we can read the entire set of files into an sf
object.
nc_ds <- read_sf_dataset(ds)
nc_ds
#> Simple feature collection with 100 features and 16 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
#> First 10 features:
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
#> 2 0.062 1.547 1834 1834 Camden 37029 37029 15 286 0
#> 3 0.109 1.325 1841 1841 Person 37145 37145 73 1556 4
#> 4 0.081 1.288 1880 1880 Watauga 37189 37189 95 1323 1
#> 5 0.044 1.158 1887 1887 Chowan 37041 37041 21 751 1
#> 6 0.086 1.267 1893 1893 Yadkin 37197 37197 99 1269 1
#> 7 0.170 1.680 1903 1903 Guilford 37081 37081 41 16184 23
#> 8 0.118 1.601 1946 1946 Madison 37115 37115 58 765 2
#> 9 0.134 1.755 1958 1958 Burke 37023 37023 12 3573 5
#> 10 0.116 1.664 1964 1964 McDowell 37111 37111 56 1946 5
#> NWBIR74 BIR79 SID79 NWBIR79 split1 split2 geometry
#> 1 954 1838 5 1237 1 1 MULTIPOLYGON (((-76.74506 3...
#> 2 115 350 2 139 1 1 MULTIPOLYGON (((-76.00897 3...
#> 3 613 1790 4 650 1 1 MULTIPOLYGON (((-78.8068 36...
#> 4 17 1775 1 33 1 1 MULTIPOLYGON (((-81.80622 3...
#> 5 368 899 1 491 1 1 MULTIPOLYGON (((-76.68874 3...
#> 6 65 1568 1 76 1 1 MULTIPOLYGON (((-80.49554 3...
#> 7 5483 20543 38 7089 1 1 MULTIPOLYGON (((-79.53782 3...
#> 8 5 926 2 3 1 1 MULTIPOLYGON (((-82.89597 3...
#> 9 326 4314 15 407 1 1 MULTIPOLYGON (((-81.81628 3...
#> 10 134 2215 5 128 1 1 MULTIPOLYGON (((-81.81628 3...
With large datasets, more often we will want query them and return a reduced set of the partitioned records. To create a query, the easiest way is to use dplyr::filter()
on the partitioning (and/or other) variables to subset the rows and dplyr::select()
to subset the columns. read_sf_dataset()
will then use the arrow_dplyr_query
and call dplyr::collect()
to extract and then process the Arrow Table into sf
.
nc_d12 <- ds %>%
filter(split1 == 1, split2 == 2) %>%
read_sf_dataset()
nc_d12
#> Simple feature collection with 20 features and 16 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -83.36472 ymin: 34.71101 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
#> First 10 features:
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
#> 2 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
#> 3 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
#> 4 0.118 1.421 1836 1836 Warren 37185 37185 93 968 4
#> 5 0.114 1.352 1838 1838 Caswell 37033 37033 17 1035 2
#> 6 0.143 1.663 1840 1840 Granville 37077 37077 39 1671 4
#> 7 0.108 1.483 1900 1900 Forsyth 37067 37067 34 11858 10
#> 8 0.111 1.392 1904 1904 Alamance 37001 37001 1 4672 13
#> 9 0.104 1.294 1907 1907 Orange 37135 37135 68 3164 4
#> 10 0.122 1.516 1932 1932 Caldwell 37027 37027 14 3609 6
#> NWBIR74 BIR79 SID79 NWBIR79 split1 split2 geometry
#> 1 10 1364 0 19 1 2 MULTIPOLYGON (((-81.47276 3...
#> 2 208 3616 6 260 1 2 MULTIPOLYGON (((-80.45634 3...
#> 3 1066 1606 3 1197 1 2 MULTIPOLYGON (((-77.21767 3...
#> 4 748 1190 2 844 1 2 MULTIPOLYGON (((-78.30876 3...
#> 5 550 1253 2 597 1 2 MULTIPOLYGON (((-79.53051 3...
#> 6 930 2074 4 1058 1 2 MULTIPOLYGON (((-78.74912 3...
#> 7 3919 15704 18 5031 1 2 MULTIPOLYGON (((-80.0381 36...
#> 8 1243 5767 11 1397 1 2 MULTIPOLYGON (((-79.24619 3...
#> 9 776 4478 6 1086 1 2 MULTIPOLYGON (((-79.01814 3...
#> 10 309 4249 9 360 1 2 MULTIPOLYGON (((-81.32813 3...
plot(sf::st_geometry(nc_d12), col="grey")
When using select()
to read only a subset of columns, if the geometry column is not returned, the default behaviour of sfarrow
is to throw an error from read_sf_dataset
. If you do not need the geometry column for your analyses, then using arrow
and not sfarrow
should be sufficient. However, setting find_geom = TRUE
in read_sf_dataset
will read in any geometry columns in the metadata, in addition to the selected columns.
# this command will throw an error
# no geometry column selected for read_sf_dataset
# nc_sub <- ds %>%
# select('FIPS') %>% # subset of columns
# read_sf_dataset()
# set find_geom
nc_sub <- ds %>%
select('FIPS') %>% # subset of columns
read_sf_dataset(find_geom = TRUE)
nc_sub
#> Simple feature collection with 100 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
#> First 10 features:
#> FIPS geometry
#> 1 37091 MULTIPOLYGON (((-76.74506 3...
#> 2 37029 MULTIPOLYGON (((-76.00897 3...
#> 3 37145 MULTIPOLYGON (((-78.8068 36...
#> 4 37189 MULTIPOLYGON (((-81.80622 3...
#> 5 37041 MULTIPOLYGON (((-76.68874 3...
#> 6 37197 MULTIPOLYGON (((-80.49554 3...
#> 7 37081 MULTIPOLYGON (((-79.53782 3...
#> 8 37115 MULTIPOLYGON (((-82.89597 3...
#> 9 37023 MULTIPOLYGON (((-81.81628 3...
#> 10 37111 MULTIPOLYGON (((-81.81628 3...
To write an sf
object into multiple files, we can again construct a query using dplyr::group_by()
to define the partitioning variables. The result is then passed to sfarrow
.
world %>%
group_by(continent) %>%
write_sf_dataset(file.path(tempdir(), "world_ds"),
format = "parquet",
hive_style = FALSE)
#> Warning: This is an initial implementation of Parquet/Feather file support and
#> geo metadata. This is tracking version 0.1.0 of the metadata
#> (https://github.com/geopandas/geo-arrow-spec). This metadata
#> specification may change and does not yet make stability promises. We
#> do not yet recommend using this in a production setting unless you are
#> able to rewrite your Parquet/Feather files.
In this example we are not using Hive style. This results in the partitioning variable not being in the folder paths.
list.files(file.path(tempdir(), "world_ds"))
#> [1] "Africa" "Antarctica"
#> [3] "Asia" "Europe"
#> [5] "North America" "Oceania"
#> [7] "Seven seas (open ocean)" "South America"
To read this style of Dataset, we must specify the partitioning variables when it is opened.
arrow::open_dataset(file.path(tempdir(), "world_ds"),
partitioning = "continent") %>%
filter(continent == "Africa") %>%
read_sf_dataset()
#> Simple feature collection with 51 features and 5 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: -17.62504 ymin: -34.81917 xmax: 51.13387 ymax: 37.34999
#> Geodetic CRS: WGS 84
#> First 10 features:
#> pop_est name iso_a3 gdp_md_est continent
#> 1 53950935 Tanzania TZA 150600.0 Africa
#> 2 603253 W. Sahara ESH 906.5 Africa
#> 3 83301151 Dem. Rep. Congo COD 66010.0 Africa
#> 4 7531386 Somalia SOM 4719.0 Africa
#> 5 47615739 Kenya KEN 152700.0 Africa
#> 6 37345935 Sudan SDN 176300.0 Africa
#> 7 12075985 Chad TCD 30590.0 Africa
#> 8 54841552 South Africa ZAF 739100.0 Africa
#> 9 1958042 Lesotho LSO 6019.0 Africa
#> 10 13805084 Zimbabwe ZWE 28330.0 Africa
#> geometry
#> 1 POLYGON ((33.90371 -0.95, 3...
#> 2 POLYGON ((-8.66559 27.65643...
#> 3 POLYGON ((29.34 -4.499983, ...
#> 4 POLYGON ((41.58513 -1.68325...
#> 5 POLYGON ((39.20222 -4.67677...
#> 6 POLYGON ((24.56737 8.229188...
#> 7 POLYGON ((23.83766 19.58047...
#> 8 POLYGON ((16.34498 -28.5767...
#> 9 POLYGON ((28.97826 -28.9556...
#> 10 POLYGON ((31.19141 -22.2515...