Data loader examples in R, Python, shell scripts, and Julia

Observable Framework uses JavaScript for code running in the browser, but the equally (or more?) important data loaders can use any language. Don’t believe us? Well, here are examples of using Python, R, shell scripts, and Julia. Even if you don’t regularly use JavaScript, you can use your skills in other languages to build fast and beautiful data apps and dashboards.

You can walk through the code here, then clone the examples repo (created by my colleague Allison Horst) yourself to build on them. The examples assume that you have the respective runtimes and libraries installed.

How do data loaders work?

When you build a Framework project, data loaders are run to fetch data from sources like databases and APIs, or run models. Their output goes into data files that are usually data formats like CSV, parquet, etc., but can be anything, like images. They are then packaged up and deployed to the server. To learn more about data loaders, see our recent blog post.

When your data app loads, these files are loaded with the page, which is much faster than querying a database. This is even more true when running analysis or models on the data. It’s also powerful, because you can run code in whatever language you like. That also gives you access to existing libraries such as pandas in Python, the vast selection of statistics and machine learning tools in R, etc. But no matter what you use on the backend, the resulting data app runs natively in your browser, and is fast and responsive.

Using data loaders for logistic regression in Python

First, let’s try a simple logistic regression on a dataset. This data loader, written in Python, loads in the Palmer Penguins dataset using pandas, and then computes the logistic regression between the penguins’ body measurements and their species. It then adds a predicted_species column to the dataset and exports it as a CSV file.

# Import libraries (must be installed)
import pandas as pd
from sklearn.linear_model import LogisticRegression
import sys

# Data access, wrangling and analysis
df = pd.read_csv("docs/data-files/penguins.csv")
df_complete = df.dropna(
    subset=["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "body_mass_g"]
)

X = df_complete.iloc[:, [2, 3, 4, 5]]
Y = df_complete.iloc[:, 0]

logreg = LogisticRegression()
logreg.fit(X, Y)

results = df_complete.copy()
results["predicted_species"] = logreg.predict(X)

df_out = df.merge(
    results[["predicted_species"]], how="left", left_index=True, right_index=True
)

# Write the data frame to CSV, and to standard output
df_out.to_csv(sys.stdout)

The original code lives in this repository here.

Fetching data to create a parquet file with DuckDB in a shell script

If you just need to run a command, perhaps to pull a file from a URL or copy from somewhere else, a simple shell script might be the way to go. Or you could script DuckDB to do the work for you and also run it from a shell script, as we do in this example. We could have also used curl here, but DuckDB can load files directly from a URL.

In this case, we’re loading data about fuel stations from an OpenEI.org API. We then select only the fuel stations in California, pick just the columns we want, and finally export the data as an efficient parquet file. This can then be loaded into DuckDB again inside the browser, or directly used with Observable Plot to create visualizations.

duckdb -csv :memory: << EOF

CREATE TABLE allp AS (
  FROM 'https://data.openei.org/files/106/alt_fuel_stations%20%28Jul%2029%202021%29.csv'
);

CREATE TABLE cafuelstations AS (
  SELECT "Fuel Type Code" as Type,
  State,
  ZIP,
  Latitude,
  Longitude 
  FROM allp
  WHERE State = 'CA'
);

COPY cafuelstations TO '$TMPDIR/cafuelstations.parquet' (FORMAT 'parquet', COMPRESSION 'GZIP');

EOF

# isatty
if [ -t 1 ]; then
  echo parquet file output at: $TMPDIR/cafuelstations.parquet
  echo "duckdb -csv :memory: \"SELECT * FROM '$TMPDIR/cafuelstations.parquet'\""
else
  cat $TMPDIR/cafuelstations.parquet
  rm $TMPDIR/cafuelstations.parquet
fi

See the original data loader here.

K-means clustering in R using a data loader

How about some simple clustering using the built-in features in R? This data loader reads in the same penguins dataset as the Python example above, converts it into a data frame, and then runs k-means clustering on it. The result is exported as a new CSV file, to be used on a dashboard or other data app.

# Attach libraries (must be installed)
library(readr)
library(dplyr)
library(tidyr)

# Data access, wrangling and analysis
penguins <- read_csv("docs/data-files/penguins.csv") |>
  drop_na(culmen_depth_mm, culmen_length_mm)

penguin_kmeans <- penguins |>
  select(culmen_depth_mm, culmen_length_mm) |>
  scale() |>
  kmeans(centers = 3)

penguin_clusters <- penguins |>
  mutate(cluster = penguin_kmeans$cluster)

# Convert data frame to delimited string, then write to standard output
cat(format_csv(penguin_clusters))

If you want to try it yourself, you can find the source here.

Processing text with Julia

Our final data loader example is written in Julia, and it shows one of the more unusual data files: text. This little example script fetches a book from Project Gutenberg, parses the text into paragraphs, and then pulls out one particular paragraph. You could easily build more on this, like text mining, sentiment analysis, and more.

#!/usr/bin/env julia

# Load Julia packages (must be installed)
using HTTP
using Gumbo
using TextAnalysis

# Function to fetch text
function fetch_text_from_url(url::String)
  response = HTTP.get(url)
  text = String(response.body)
  text = replace(text, "\r" => "")
  return text
end

# Split into paragraphs
function split_into_paragraphs(text::String)
    paragraphs = split(text, "\n\n")
    return paragraphs
end

# Return a paragraph by number
function get_paragraph_by_number(text::String, paragraph_number::Int)
  paragraphs = split_into_paragraphs(text)
  return paragraphs[paragraph_number]
end

# Text URL
url = "https://www.gutenberg.org/cache/epub/1065/pg1065.txt"

# Fetch text and access a paragraph by number
text = fetch_text_from_url(url)
paragraph_number = 29
result_paragraph = get_paragraph_by_number(text, paragraph_number)

# Print text to standard output
println(result_paragraph)

The code is available here for your perusal.

Literally any other language

The examples above are just to show a few specific pieces of code. Observable Framework recognizes a few file extensions by default, but you can run code in any language. One way to do this is to use a shell script as a bridge, as you saw with the DuckDB example above. If you want to make things more official, you can register your own extensions and associated interpreters. As of this writing, the list of languages that are supported by default includes JavaScript, TypeScript, Python, R, Rust, Go, Java, Julia, PHP, shell scripts, and even binaries.

Go ahead, and try it yourself in a Framework project, it’s easy to get started! Whatever process or pipeline you want to use to access, process, or model your data is available now, and can power fast, interactive data apps that don’t make your users wait.