Using Sparklyr Package in R

Aug 29, 2017

In this tutorial, I will give an overview of how to use Sparklyr in RStudio environment. Sparklyr is a package written for R programming language, a native library that uses dplyr functionalities. Sparklyr creates its own environment and uses almost all of functions like, select, filter, arrange, mutate, transmute and summarize. Sparklyr also supports SQL functions using DBI.

It also has a support for distributed machine learning using Spark MLlib and H2O, which we will not discuss in this article.

In this tutorial, I will use a small dataset to carry out some basic functions of sparklyr. Download the dataset from here.

First, we need to install and load sparklyr and all the required packages.

install.packages(c("sparklyr", "devtools", "dplyr", "xlsx"))
library("sparklyr")
library("dplyr")
library("devtools")
library("xlsx")

# Getting updated sparklyr package
devtools::install_github("rstudio/sparklyr")

To install spark and connect to spark.

spark_install(version = "2.1.0")
sc <- spark_connect(master = "local")

Here, the master can be changed to remote, we are using locally on our machine.
To read data directly into spark. If you are using RStudio then you will see a separate tab for spark variables.

data_temp <- spark_read_csv(sc, name = "spark_data", path = "data.csv", header = TRUE, delimiter = ",")

You will notice that data_temp is referring to spark dataset here, so you will not be able to use it like a R local dataframe. Some functions still work like, colnames() to get the column names, head() to see starting six rows.

Now to do all the data manipulation, we will use dplyr functions.

# Removing all -1 pdays records
data_temp <- data_temp %>%  filter(pdays != -1)

To change a column name, we can use the following.

# Changing column name 'y' to 'outcome'
newnames<- colnames(data_temp)
newnames[newnames =="y"]<-"outcome"
data_temp <- data_temp %>% select_(.dots=setNames(colnames(data_temp), newnames))

Our dataset has many columns with binary values but they are yes and no, we will change it to Boolean that is True and False.

First we will create a function to check which columns are binary.

is.binary <- function(v) {
  x <- unique(v)
  length(x) - sum(is.na(x)) == 2L
}

# Replacing yes and no with True and False, respectively
data_temp <- mutate_if(data_temp,is.binary, funs(ifelse(. == "yes", TRUE,FALSE)))

To break a column into two separate columns, we can use a function ft_regex_tokenizer. This breaks a column on the basis a pattern, like here we are splitting based on a white space.

data_temp <- ft_regex_tokenizer(data_temp, input.col = "name", output.col = "FName", pattern = ' .*$')

To combine two columns, like in our scenario a day and month.

data_temp <- mutate(data_temp, date = paste( day , month , "2017", sep = "/" ))

Extracting only age column in ascending order.

# Using select, filter and arrange
data_temp %>% select(age) %>% filter (age > 30) %>%
        arrange(age)

We can now directly write into CSV from spark.

spark_write_csv(data_temp, path="mydata_csv")

We can use `ggplot2 `to visualize but for that we need to collect the data from spark to local R data frame using collect() function.

You can find full code at the Github.

These were few basic functions and their usage. You may face issues in using them especially mutate does not work every time with the sparklyr dataframe. The reason is because it is still in the development phase and top scientists are working on it to make it better for use. For queries, you may ask on sparklyr Github.