8 min read

Cheat Sheet: three ways to use dplyr to retrieve data from a Microsoft SQL Database

Motivation

I use R to extract data held in Microsoft SQL Server databases on a daily basis.

When I first started I was confused by all the different ways to accomplish this task. I was a bit overwhelmed trying to choose the, “best,” option given the specific job at hand.

I want to share what approaches I’ve landed on to help others who may want a simple list of options to get started with.

Scope

This post is about reading data from a database, not writing to one.

I prefer to use packages in the tidyverse so I’ll focus on those packages.

While it’s possible to generalize many of the concepts I write about here to other DBMS systems I will focus exclusively on Microsoft SQL Server. I hope this will provide simple, prescriptive guidance for those working in a similar configuration.

The data for these examples is stored using Microsoft SQL Server Express. Free download available here.

One last thing - these are a few options I populated my toolbox with. They have served me well over the past two years as an analyst in an enterprise environment, but are definitely not the only options available.

Setup

Connect to the server

I use the keyring package to keep my credentials out of my R code. You can use the great documentation available from RStudio to learn how do the same.

user_name <-
  keyring::key_list("dbi_test")$username

con <- DBI::dbConnect(
      odbc::odbc(),
      Driver = "SQL Server",
      Server = "testsqlserver",
      UID = user_name,
      PWD = keyring::key_get("dbi_test", user_name)
   )

Write some sample data

Note that I set the temporary argument to TRUE so that the data is written to the tempdb on SQL server, which will result in it being deleted on disconnection.

This results in dplyr prefixing the table name with, “##.”

SOURCE: https://db.rstudio.com/dplyr/#connecting-to-the-database

dplyr::copy_to(con, nycflights13::flights, "flights",
  temporary = TRUE
) 
## Created a temporary table named: ##flights
dplyr::copy_to(con, nycflights13::airlines, "airlines",
  temporary = TRUE
) 
## Created a temporary table named: ##airlines
dplyr::copy_to(con, nycflights13::planes, "planes",
  temporary = TRUE
) 
## Created a temporary table named: ##planes

Option 1: Use dplyr syntax and let dbplyr handle the rest

When I use this option

This is my default option.

I do almost all of my analysis in R and this avoids fragmenting my work and thoughts across different tools.

Examples

Example 1: filter rows, and retrieve selected columns

dplyr::tbl(con, "##flights") %>%
  dplyr::filter(origin == "JFK", dest == "BQN" ) %>% 
  dplyr::select(flight, tailnum, dep_time, sched_dep_time,
                dep_delay) %>% 
  # I will remove the following line if exploring the data and
  # not interested in actually retrieving all records
  dplyr::collect() 
## # A tibble: 599 x 5
##    flight tailnum dep_time sched_dep_time dep_delay
##     <int> <chr>      <int>          <int>     <dbl>
##  1    725 N804JB       544            545        -1
##  2    727 N588JB      2356           2359        -3
##  3    725 N624JB       539            545        -6
##  4    727 N789JB      2354           2359        -5
##  5    727 N618JB       235           2359       156
##  6    725 N779JB       543            545        -2
##  7    725 N652JB       608            545        23
##  8    727 N599JB      2358           2359        -1
##  9    725 N589JB       602            545        17
## 10    727 N649JB      2357           2359        -2
## # ... with 589 more rows

Example 2: join across tables and retrieve selected columns

dplyr::tbl(con, "##flights") %>%
  dplyr::select(origin, dest, carrier, tailnum ) %>% 
  dplyr::left_join( #dplyr can join database tables 
    dplyr::tbl(con, "##airlines"), by = "carrier"
    ) %>% 
  mutate(airline = name) %>% 
  select(-carrier) %>% # only want the "human readable" airline name
  dplyr::collect()
## # A tibble: 336,776 x 5
##    origin dest  tailnum name                     airline                 
##    <chr>  <chr> <chr>   <chr>                    <chr>                   
##  1 EWR    IAH   N14228  United Air Lines Inc.    United Air Lines Inc.   
##  2 LGA    IAH   N24211  United Air Lines Inc.    United Air Lines Inc.   
##  3 JFK    MIA   N619AA  American Airlines Inc.   American Airlines Inc.  
##  4 JFK    BQN   N804JB  JetBlue Airways          JetBlue Airways         
##  5 LGA    ATL   N668DN  Delta Air Lines Inc.     Delta Air Lines Inc.    
##  6 EWR    ORD   N39463  United Air Lines Inc.    United Air Lines Inc.   
##  7 EWR    FLL   N516JB  JetBlue Airways          JetBlue Airways         
##  8 LGA    IAD   N829AS  ExpressJet Airlines Inc. ExpressJet Airlines Inc.
##  9 JFK    MCO   N593JB  JetBlue Airways          JetBlue Airways         
## 10 LGA    ORD   N3ALAA  American Airlines Inc.   American Airlines Inc.  
## # ... with 336,766 more rows

Example 3: Summarize and count

dplyr::tbl(con, "##flights") %>% 
  filter(!is.na(tailnum)) %>% 
  dplyr::left_join(tbl(con, "##planes"), by = "tailnum") %>% 
  dplyr::group_by(engine) %>% 
  dplyr::summarise(flights = n()) %>%
  dplyr::collect() %>% 
  dplyr::arrange(desc(flights))
## # A tibble: 7 x 2
##   engine        flights
##   <chr>           <int>
## 1 Turbo-fan      240915
## 2 <NA>            50094
## 3 Turbo-jet       40976
## 4 Reciprocating    1774
## 5 Turbo-shaft       410
## 6 4 Cycle            48
## 7 Turbo-prop         47

Quite a few tailnum values in flights, are not present in planes, interesting!

Option 2: Write SQL syntax and have dplyr and dbplyr run the query

When I use this option

I use this option when I am reusing a fairly short, existing SQL query with minor modifications.

Example 1: Simple selection of records using SQL syntax

query <- "SELECT  flight, tailnum, dep_time, origin, dest 
            FROM ##flights"

dplyr::tbl(con, dplyr::sql(query)) %>% 
  dplyr::collect()
## # A tibble: 336,776 x 5
##    flight tailnum dep_time origin dest 
##     <int> <chr>      <int> <chr>  <chr>
##  1   1545 N14228       517 EWR    IAH  
##  2   1714 N24211       533 LGA    IAH  
##  3   1141 N619AA       542 JFK    MIA  
##  4    725 N804JB       544 JFK    BQN  
##  5    461 N668DN       554 LGA    ATL  
##  6   1696 N39463       554 EWR    ORD  
##  7    507 N516JB       555 EWR    FLL  
##  8   5708 N829AS       557 LGA    IAD  
##  9     79 N593JB       557 JFK    MCO  
## 10    301 N3ALAA       558 LGA    ORD  
## # ... with 336,766 more rows

Example 2: Use dplyr syntax to enhance a raw SQL query

dplyr::tbl(con, dplyr::sql(query)) %>% #reuse query from previous example
  # add a filter state on top of the raw sql statement
  dplyr::filter(origin == "JFK", dest == "BQN" ) %>% 
  collect()
## # A tibble: 599 x 5
##    flight tailnum dep_time origin dest 
##     <int> <chr>      <int> <chr>  <chr>
##  1    725 N804JB       544 JFK    BQN  
##  2    727 N588JB      2356 JFK    BQN  
##  3    725 N624JB       539 JFK    BQN  
##  4    727 N789JB      2354 JFK    BQN  
##  5    727 N618JB       235 JFK    BQN  
##  6    725 N779JB       543 JFK    BQN  
##  7    725 N652JB       608 JFK    BQN  
##  8    727 N599JB      2358 JFK    BQN  
##  9    725 N589JB       602 JFK    BQN  
## 10    727 N649JB      2357 JFK    BQN  
## # ... with 589 more rows

Option 3: Store the SQL query in a text file and have dplyr and dbplyr run the query

When I use this option

I use this approach under the following conditions:

  1. I’m reusing existing SQL code or when collaborating with someone who will be writing new code in SQL
  2. The SQL code is longer than a line or two

I prefer to, “modularize,” my R code. Having an extremely long SQL statement in my R code doesn’t abstract away the complexity of the SQL query. Putting the query into it’s own file helps achieve my desired level of abstraction.

In conjunction with source control it makes tracking changes to the definition of a data set simple.

More importantly, it’s a really useful way to collaborate with others who are comfortable with SQL but don’t use R. For example, I recently used this approach on a project involving aggregation of multiple data sets. Another team member focused on building out the data collection logic for some of the data sets in SQL. Once he had them built and validated he handed off the query to me and I pasted it into a text file.

Step 1: Put your SQL code into a text file

Here is some example SQL code that might be in a file

SELECT f.year
        ,f.month
        ,f.day
        ,f.dep_time
        ,a.name
        ,p.manufacturer
        ,p.model
  FROM ##flights f
  LEFT JOIN ##airlines a ON f.carrier = a.carrier
  LEFT JOIN ##planes p ON f.tailnum = p.tailnum
  WHERE p.manufacturer IS NOT NULL 

Let’s say that SQL code was stored in a text file called, flights.sql

Step 2: Use the SQL code in the file to retrieve data and execute the query.

#reference: https://stackoverflow.com/a/46105261/7416441 
query <- readr::read_file("flights.sql") 

dplyr::tbl(con, dplyr::sql(query))
## # A tibble: 284,170 x 7
##     year month   day dep_time name                    manufacturer    model     
##    <int> <int> <int>    <int> <chr>                   <chr>           <chr>     
##  1  2013     1     1      517 United Air Lines Inc.   BOEING          737-824   
##  2  2013     1     1      533 United Air Lines Inc.   BOEING          737-824   
##  3  2013     1     1      542 American Airlines Inc.  BOEING          757-223   
##  4  2013     1     1      544 JetBlue Airways         AIRBUS          A320-232  
##  5  2013     1     1      554 Delta Air Lines Inc.    BOEING          757-232   
##  6  2013     1     1      554 United Air Lines Inc.   BOEING          737-924ER 
##  7  2013     1     1      555 JetBlue Airways         AIRBUS INDUSTR~ A320-232  
##  8  2013     1     1      557 ExpressJet Airlines In~ CANADAIR        CL-600-2B~
##  9  2013     1     1      557 JetBlue Airways         AIRBUS          A320-232  
## 10  2013     1     1      558 JetBlue Airways         AIRBUS          A320-232  
## # ... with 284,160 more rows