top of page
Search
  • Writer's pictureMeryl Marie

The Unseen Pandemic: COVID-19 in US Prisons


Original Art by Meryl Duff

The COVID-19 pandemic has changed the world forever. On a scale never seen before, humanity has experienced a collective trauma. Thanks to technology we have unprecedented access to data around spread, mortality rates, vaccines, and more. Also thanks to technology, we have seen the faces of the pandemic from every corner of the globe. There is a population in the United States, however, that has struggled to get their experiences with this devastating pandemic out to the public.


As of December 2020, the rate of COVID-19 among prisoners in the United States was 4 times as high as than the general population. Even as the first Americans began getting their vaccines, the spread in prisons showed no signs of slowing. According to Homer Venters, the former chief medical officer at New York’s Rikers Island jail complex, those numbers are an under count. Because of crowded conditions, prisoners cannot social distance, and don't receive proper care from the medical staff inside. This is why data collection and sharing these numbers is so important.


The data for this project is collected by The Associated Press and The Marshall Project, a nonprofit investigative newsroom dedicated to the U.S. criminal justice system. It is collected weekly by Marshall Project and AP reporters who call the facilities to get "the cumulative number who tested positive among staff and prisoners, and the numbers of deaths for each group."


Data Cleaning & EDA


The data is relatively clean, and The Marshall Project has done an amazing job at aggregating it for reporting. There are some inconsistencies that appeared in the frequency for the time series aspect, but I was able to fix that by resampling the data (basically a pd.groupby for time data) at the weekly level.


Another difficult aspect of the data includes reporting of the cumulative total. If a state does not report for multiple weeks, the data stays at the same total (rather than reporting on 0 that week), and then will drastically jump when the number is eventually reported.


The goal of this project was to produce a time series model to forecast COVID-19 cases in New Jersey prisons. Here is how I went about that.


Modeling


Below is a simple plot of my Linear Time Series regression with 2 lags. The blue line is training data, the green is the data that is predicted values and the yellow is actual values. You can see that the model is attempting to account for the large spike in growth directly before it. It then goes on to correct itself, with a RMSE of 98 and R2 score of .70, making it my most successful model.



My second model, the AutoRegressive Integrated Moving Average (ARIMA) was much more involved with very little return. After testing the undifferenced data with the Augmented Dickey-Fuller test, I determined that the data was already stationary, so we did not have to difference it in the model. I struggled to make the model predict due to some frequency errors (covered in this blog post), but eventually got it to run. With an RMSE at 31, I was excited to have a better performing model! Then, I calculated the R2 score and it turned out to be 0, indicating that the model cannot account for any variance in the data.


Results & Predictions


I ended up using my first model, which was meant to be the baseline model. I successfully predicted on an unseen date, so we will find out if it was correct!


This project taught me so much about the data science process. I struggled to find a topic and data to use for it. I wanted to balance doing something I was passionate about (issues surrounding mass incarceration and social justice) with something that hadn't been done before, and something that felt useful. I also decided to learn Tableau for this presentation, and created some really fun data visualizations with this project.


Data Visualization


All data is from the second data pull in April 2021.










12 views0 comments

Recent Posts

See All

Life After a Data Science Bootcamp

The last time I updated my LinkedIn, I told people I was working on a personal project and to keep and eye out! Well, of course life happens and I was busy with interviews and vacation and living, so

Attention Sociology Students: Try Data Science!

When I was in college, I took a Statistics 101 class for my Sociology/Anthropology major. We calculated chi-square tests, correlation, and statistical significance to find relationships between variab

A Fix for My ARIMA: Frequency in Time Series Data

I recently attempted an Auto Regressive Integrated Moving Average (ARIMA) model for my time series data, COVID-19 cases in New Jersey prisons. After creating a successful linear time series regression

bottom of page