AI for Good Workshop 2025¶
Join us for the AI for Good Workshop 2025, part of the UN's AI for Good workshop series! This workshop will take place online on February 18, 2025, from 9:00 AM to 10:30 AM EST. It is free and open to the public. Please register using this link: Modeling population dynamics with AI: A hands-on workshop with the Population Dynamics Foundation Model.
Overview¶
Explore the transformative potential of the Population Dynamics Foundation Model (PDFM), a cutting-edge AI model designed to capture complex, multidimensional interactions among human behaviors, environmental factors, and local contexts. This workshop provides an in-depth introduction to PDFM Embeddings and their applications in geospatial analysis, public health, and socioeconomic modeling.
Participants will gain hands-on experience with PDFM Embeddings to perform advanced geospatial predictions and analyses while ensuring privacy through the use of aggregated data. Key components of the workshop include:
- Introduction to PDFM Embeddings: Delve into the model architecture of PDFM and discover how aggregated data (such as search trends, busyness levels, and weather conditions) generates location-specific embeddings.
- Data Preparation: Learn to integrate ground truth data, including health statistics and socioeconomic indicators, with PDFM Embeddings at the postal code or county level.
- Hands-On Exercises: Engage with interactive Colab notebooks to explore real-world applications, such as predicting housing prices using Zillow data and nighttime light predictions with Google Earth Engine data.
- Visualization and Interpretation: Analyze and visualize geospatial predictions and PDFM features in 3D, enhancing your ability to interpret complex datasets.
By the end of this workshop, participants will have a strong foundation in utilizing PDFM Embeddings to address real-world geospatial challenges.
Target audience¶
This workshop is designed for data scientists, geospatial analysts, researchers, urban planners, and professionals in public health, economics, or environmental science who want to integrate AI into their workflows.
Prerequisites¶
- A Google Colab account
- Access to the PDFM embeddings
- Basic understanding of Python programming and geospatial data concepts is recommended
Recording¶
The recording of the workshop will be made available on YouTube after the event. Stay tuned for the link!
Environment setup¶
Install the required packages locally¶
If you are running this notebook locally, you can install the required packages using the following commands:
conda create -n sam python=3.12
conda activate geo
conda install -c conda-forge mamba
mamba install -c conda-forge leafmap maplibre scikit-learn
Use Google Colab¶
If you are using Google Colab, run the following cell to install the required packages:
%pip install "leafmap[maplibre]" scikit-learn
Predicting US Housing Prices Using PDFM and Zillow Data¶
To follow along with the workshop, you will need to have access to the PDFM embeddings. Please request access to the PDFM embeddings here. Download the embeddings and upload them to your Google Drive or Google Colab environment.
This notebook is adapted from the PDFM tutorial. Credit goes to the authors of the PDFM tutorial.
Import Libraries¶
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from leafmap.common import evaluate_model, plot_actual_vs_predicted, download_file
Download Zillow Data¶
The Zillow housing data can be downloaded from the Zillow Research Data page. We will use the Zillow Home Value Index (ZHVI) data for single-family homes at the county level.
zhvi_url = "https://github.com/opengeos/datasets/releases/download/us/zillow_home_value_index_by_zipcode.csv"
zhvi_file = "zillow_home_value_index_by_zipcode.csv"
if not os.path.exists(zhvi_file):
download_file(zhvi_url, zhvi_file)
Process Zillow Data¶
The Zillow ZHVI dataset contains a RegionName
column that corresponds to the zip code. We need to format the zip code to match the PDFM embeddings' postal_code
format, which looks like zip/XXXXX
.
zhvi_df = pd.read_csv(zhvi_file, dtype={"RegionName": "string"})
zhvi_df.index = zhvi_df["RegionName"].apply(lambda x: f"zip/{x}")
zhvi_df.head()
embeddings_file_path = "zcta_embeddings.csv"
if not os.path.exists(embeddings_file_path):
raise FileNotFoundError("Please request the embeddings from Google")
Load PDFM Embeddings¶
We will load the PDFM embeddings from Google Colab where you saved the embeddings.
zipcode_embeddings = pd.read_csv(embeddings_file_path).set_index("place")
zipcode_embeddings.head()
Join Zillow and PDFM Data¶
We will join the Zillow and PDFM data using the GeoDataFrame index.
data = zhvi_df.join(zipcode_embeddings, how="inner")
data.head()
embedding_features = [f"feature{x}" for x in range(330)]
label = "2025-01-31"
data = data.dropna(subset=[label])
Split Train and Test Data¶
We will split the data into training and testing datasets using a 80-20 split. We will use the training data to train a machine learning model to predict housing prices.
data = data[embedding_features + [label]]
X = data[embedding_features]
y = data[label]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train.head()
y_train.head()
Fit Linear Regression Model¶
We will fit a linear regression model to predict the Zillow Home Value Index (ZHVI) using the PDFM embeddings.
# Initialize and train a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
Evaluate Linear Regression Model¶
evaluation_df = pd.DataFrame({"y": y_test, "y_pred": y_pred})
metrics = evaluate_model(evaluation_df)
print(metrics)
xy_lim = (0, 3_000_000)
plot_actual_vs_predicted(
evaluation_df,
xlim=xy_lim,
ylim=xy_lim,
title="Actual vs Predicted Home Values",
x_label="Actual Home Value",
y_label="Predicted Home Value",
)
Fit K-Nearest Neighbors Model¶
We will fit a K-Nearest Neighbors (KNN) model to predict the Zillow Home Value Index (ZHVI) using the PDFM embeddings.
k = 5
model = KNeighborsRegressor(n_neighbors=k)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Evaluate K-Nearest Neighbors Model¶
evaluation_df = pd.DataFrame({"y": y_test, "y_pred": y_pred})
metrics = evaluate_model(evaluation_df)
print(metrics)
plot_actual_vs_predicted(
evaluation_df,
xlim=xy_lim,
ylim=xy_lim,
title="Actual vs Predicted Home Values",
x_label="Actual Home Value",
y_label="Predicted Home Value",
)
Mapping PDFM Features and Predicted Housing Prices¶
Import Libraries¶
import os
import pandas as pd
import geopandas as gpd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from leafmap.common import evaluate_model, plot_actual_vs_predicted, download_file
import leafmap.maplibregl as leafmap
Download Zillow Data¶
Download the Zillow home value data at the county level.
zhvi_url = "https://github.com/opengeos/datasets/releases/download/us/zillow_home_value_index_by_county.csv"
zhvi_file = "zillow_home_value_index_by_county.csv"
if not os.path.exists(zhvi_file):
download_file(zhvi_url, zhvi_file)
Process Zillow Data¶
The county-level Zillow ZHVI dataset contains a StateCodeFIPS
and MunicipalCodeFIPS
column that corresponds to the state and county FIPS codes. We need to format the FIPS codes to match the PDFM embeddings' postal_code
format, which looks like geoId/XXYYY
.
zhvi_df = pd.read_csv(
zhvi_file, dtype={"StateCodeFIPS": "string", "MunicipalCodeFIPS": "string"}
)
zhvi_df.index = "geoId/" + zhvi_df["StateCodeFIPS"] + zhvi_df["MunicipalCodeFIPS"]
zhvi_df.head()
Request access to PDFM Embeddings¶
The PDFM embeddings zip file you downloaded earlier contains a county.geojson
file, which contains US county boundaries.
county_geojson = "county.geojson"
if not os.path.exists(county_geojson):
raise FileNotFoundError("Please request the embeddings from Google")
Load county boundaries¶
county_gdf = gpd.read_file(county_geojson)
county_gdf.set_index("place", inplace=True)
county_gdf.head()
Join home value data and county boundaries¶
df = zhvi_df.join(county_gdf)
zhvi_gdf = gpd.GeoDataFrame(df, geometry="geometry")
zhvi_gdf.head()
column = "2025-01-31"
gdf = zhvi_gdf[["RegionName", "State", column, "geometry"]]
gdf.head()
Visualize home values in 2D¶
m = leafmap.Map(style="liberty")
first_symbol_id = m.find_first_symbol_layer()["id"]
m.add_data(
gdf,
cmap="Blues",
column=column,
legend_title="Median Home Value",
name="Median Home Value",
before_id=first_symbol_id,
)
m.add_layer_control()
m
Visualize home values in 3D¶
m = leafmap.Map(style="liberty", pitch=60)
m.add_data(
gdf,
cmap="Blues",
column=column,
legend_title="Median Home Value",
extrude=True,
scale_factor=3,
before_id=first_symbol_id,
name="Median Home Value",
)
m.add_layer_control()
m
Load PDFM county embeddings¶
embeddings_file_path = "county_embeddings.csv"
embeddings_df = pd.read_csv(embeddings_file_path).set_index("place")
embeddings_df.head()
df = embeddings_df.join(county_gdf)
embeddings_gdf = gpd.GeoDataFrame(df, geometry="geometry")
embeddings_gdf.head()
Visualize PDFM features¶
Select any of the 330 PDFM features to visualize.
column = "feature329" # Change this to the feature you want to use
gdf = embeddings_gdf[[column, "state", "county", "geometry"]]
gdf.head()
m = leafmap.Map(style="liberty")
m.add_data(
gdf,
cmap="Blues",
column=column,
legend_title=column,
before_id=first_symbol_id,
name=column,
)
m.add_layer_control()
m
m = leafmap.Map(style="liberty", pitch=60)
m.add_data(
gdf,
cmap="Blues",
column=column,
legend_title=column,
before_id=first_symbol_id,
name=column,
extrude=True,
scale_factor=0.00005,
)
m.add_layer_control()
m
Join Zillow and PDFM Data¶
data = zhvi_df.join(embeddings_df, how="inner")
data.head()
embedding_features = [f"feature{x}" for x in range(330)]
label = "2025-01-31" # Change this to the date you want to predict
data = data.dropna(subset=[label])
Split Train and Test Data¶
data = data[embedding_features + [label]]
X = data[embedding_features]
y = data[label]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Fit Linear Regression Model¶
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Evaluate Linear Regression Model¶
evaluation_df = pd.DataFrame({"y": y_test, "y_pred": y_pred})
metrics = evaluate_model(evaluation_df)
print(metrics)
xy_lim = (0, 1_000_000)
plot_actual_vs_predicted(
evaluation_df,
xlim=xy_lim,
ylim=xy_lim,
title="Actual vs Predicted Home Values",
x_label="Actual Home Value",
y_label="Predicted Home Value",
)
Join predicted values with county boundaries¶
df = evaluation_df.join(gdf)
df["difference"] = df["y_pred"] - df["y"]
evaluation_gdf = gpd.GeoDataFrame(df, geometry="geometry")
evaluation_gdf.drop(columns=["category", "color", column], inplace=True)
evaluation_gdf.head()
Visualize actual home values¶
m = leafmap.Map(style="liberty", pitch=60)
m.add_data(
evaluation_gdf,
cmap="Blues",
column="y",
legend_title="Actual Home Value",
before_id=first_symbol_id,
name="Actual Home Value",
extrude=True,
scale_factor=3,
)
m.add_layer_control()
m
Visualize predicted home values¶
m = leafmap.Map(style="liberty", pitch=60)
m.add_data(
evaluation_gdf,
cmap="Blues",
column="y_pred",
legend_title="Predicted Home Value",
before_id=first_symbol_id,
name="Predicted Home Value",
extrude=True,
scale_factor=3,
)
m.add_layer_control()
m
Visualize difference between predicted and actual home values¶
m = leafmap.Map(style="liberty", pitch=60)
m.add_data(
evaluation_gdf,
cmap="coolwarm",
column="difference",
legend_title="y_pred-y",
before_id=first_symbol_id,
name="Difference",
extrude=True,
scale_factor=3,
)
m.add_layer_control()
m