The Research Process – Roots of Education

Selecting The Sources

Our primary data source is the Stanford Education Data Archive (SEDA) dataset Version 5.0, specifically the dataset titled seda_geodist_pool_cs_5.0_updated_20240319.csv.

This dataset compiles standardized test results in the form of student academic achievement levels aggregated across geographic school districts over all the states nationwide in the U.S., with distinctive subcategory breakdowns by economic status, race, gender, and subjects. The academic achievement levels within a geographic school district are pooled over Grades 3-8, between the academic years of 2008-09 to 2018-19, and over the subjects of Math & RLA (Reading Language Arts).

These raw achievement levels are subsequently converted to a standardized, nationally comparable average using the cohort standardized scale in order to effectively account for the discrepancies in testing conditions/procedures across the different states. The refined achievement levels then measure how a student in the geographic school district does relative to the national average, measured by number of standard deviations above/below.

Alongside the primarily source of the SEDA pooled dataset, scholarly, peer-reviewed secondary sources & literature to support our research questions were collected using keywords such as “education achievement gap by race”, “stanford education data archive”, and “education achievement gap socioeconomic”, utilizing databases such as Google Scholar and UCLA Library from which the sources surrounding racial, socioeconomic, and gendered achievement disparities (gaps) were eventually selected.

We used this dataset because:

It is large, reliable, and already cleaned and organized by researchers.
It includes key indicators for racial, gender, and economic achievement gaps.
It allows us to explore our key questions about racial achievement gaps in the U.S.
allows for cross-state comparisons from different state assessments by converting achievement levels into a common national cohort scale.

Processing The Data

When beginning to process our data, we initially started with Breve & Open Refine to examine the data in order to visually determine if there was ‘dirty data‘ such as blank/empty rows or columns, cells with similar names to consolidate into one, and even potential incomplete values. However, when examining the dirty data using Breve, there were no major discrepancies, as was expected for the most part as this was a large-scale research project with academic editors. Upon further examination through Open Refine, using the “Facet” feature alongside “Cluster & Edit“, there were certain rows that had empty values for the cohort slope, grade slope, and subject-based achievement gap quantitative variables (columns), so we kept this in mind when constructing our visualizations especially when working with averages/sums of these academic achievement variables.

We used Tableau to create all our data visualizations, including: Bar Chart (4), Scatterplot, and Map.

We also built a timeline using TimelineJS that shows key historical events and policies related to racial inequality in education. This helped connect patterns in the dataset to real-world social and political contexts.

_{^{“Visualization is very much based on the data behind it, but you still have to make choices based}}_{^{on factors outside the actual data file.”}}
_^——Yau

As Yau (2021) explained, when creating our charts, we thought carefully about who would see them and what we wanted to show. We made decisions like which colors to use, which type of chart to choose, and which group to include, so that the message would be clear and easy to understand.

Presenting The Narrative

_{“You are not just transmitting data—you are shaping interpretation.”}

_{^{——Brandon Walsh and Sarah Horowitz}}

Walsh and Horowitz (2021) explained this in, “Introduction to Text Analysis”. This reminded us that how we design and share our charts can influence how people understand racial achievement gaps.

For instance, when comparing average test scores and achievement gaps by subgroup, we used grouped bar charts to clearly show differences between socioeconomic and racial disparities. Each chart followed consistent visual language with clear labels, intuitive legends, and a thoughtful layout.

We used Tableau to create bar charts and maps that highlight the patterns in our data. Each visualization was designed with our audience in mind—people who may not be familiar with education data but want to understand the broader trends & patterns. We made choices about color schemes, chart types, and which variables to include to make sure the patterns stood out clearly.

Back to Home