Blog | duffparty

Life After a Data Science Bootcamp

Meryl Marie
Sep 10, 2021
2 min read

The last time I updated my LinkedIn, I told people I was working on a personal project and to keep and eye out! Well, of course life happens and I was busy with interviews and vacation and living, so I haven't been able to attack it the way I wanted.

It was a good lesson in remembering that side projects, hustles, and productivity do not define my worth. I wanted to start data science because of a few reasons, including money and jobs and professional growth, but also because I truly enjoy statistics. I remembered my favorite classes in college and how they centered around social statistics and predictions.

I will be working on the project because I want to study conspiracy theories, help people combat misinformation, and identify white supremacist dog whistles - not because I want to impress my LinkedIn contacts. I feel like there are many seemingly innocuous conspiracies that on the surface are fun, silly, slightly believable....and then as you delve further into them, they release the stench of white supremacy.

Abbie Richards has a helpful and elegant explanation of conspiracies. An inverted pyramid (lol) that starts narrow and "grounded in reality" grows into a large, detached-from-reality, anti-Semitic group of common rightwing conspiracies.

I find this fascinating, and want to spend some time researching it using my data science skills. I will uncover the dog whistles that create these white supremacist lies, and maybe someday create a bot, or dictionary, or app that identifies the language and alerts people to it. I would hope that this can slow the malignant spread of misinformation and prevent the brainwashing that is now putting peoples lives at risk. But, I also need to find a full-time job.

I spent time this summer looking for a job, prepping for interviews, and stressed about those interviews. I aced some interviews, messed up a lot, forgot how to code, remembered how to code, cried, drank wine, worked out, slept in, went paddle boarding, went boating, walked my dog, and so much more. I got married, I saw my family, and my best friend had a baby. I lived so much of my life in the last few months, and remembered how nice it is to have time to enjoy these things.

The job search after an intense coding bootcamp can challenge your coping skills and shake your confidence. But it can also open your eyes up to a life that doesn't revolve around work. I realized that I don't have a dream job, I have a dream life, and my next job will help me get there. That is what I hope to take with me when I start my new role in a few weeks. Until then, I will be sleeping in, walking my dog, and pondering (without a timeline) my next project.

Attention Sociology Students: Try Data Science!

Meryl Marie
Apr 19, 2021
2 min read

When I was in college, I took a Statistics 101 class for my Sociology/Anthropology major. We calculated chi-square tests, correlation, and statistical significance to find relationships between variables. We investigated questions inspired by the General Social Survey (GSS) data, which has information on family life, careers, demographics, and more. At the time, it felt overwhelming and confusing. I couldn't remember the difference between the independent and dependent variables and found SPSS hard to work with. But I wrote a paper on how Americans perceive women in politics, and truly enjoyed the difficult, confusing, and exciting process of integrating the quantitative data analysis with academic & qualitative research.

After that experience, though, I soon forgot my stats knowledge. When I entered the data analysis field early in my career, I tried to remember how to apply my basic statistics to a business problem. The idea excited me - but I couldn't quite figure it out.

Fast forward to 2020 and I feel like I finally have the answer on how to use my sociology degree in the "real world", and that is through data science!

Sociologists, and social scientists in general, are no strangers to investigating large problems and looking for solutions. The research process for sociology is closely related to the data science process. Many times people think that Machine Learning and Data Scientists only deal with clean, large data sets. Unfortunately, many times you're looking to solve a problem with data- whether it is a business problem or social problem - you have to put it in the correct format first. Models can be specific and finicky, requiring specific parameters to run. Sometimes, you think you have the right information, but in reality your model is off because of one misspelled variable and your model won't predict anything. That is why a data scientist has to work hard to make sure the data is in the correct format by cleaning, munging, and/or wrangling before you can start modeling. Similarly, quantitative sociologists must take care in setting up their data collection to make sure it can be properly investigated using statistical methods as well.

My favorite model that we learned in the General Assembly Data Science Immersive program is logistic regression. Similar to a linear regression, which predicts a numeric value based on a number of features, the logistic regression predicts a class or category. Not only can it predict a class such as race or gender (very exciting for a social scientist), it can tell the data scientist which features in the dataset helped it get there. These models are called "white box" models, meaning we can make inferences based on the coefficients that the model produces.

If you are like me, and get excited when you solve a puzzle, enjoy reading and learning about data, and miss the thrill of academic research but don't want to BE in academia, I really can't recommend a data science program enough. And while I don't have a career in the industry yet, I am excited for all the possibilities ahead!

Feel free to reach out if you have questions about my experience with data science and sociology:

Instagram

Twitter

The Unseen Pandemic: COVID-19 in US Prisons

Meryl Marie
Apr 19, 2021
3 min read

The COVID-19 pandemic has changed the world forever. On a scale never seen before, humanity has experienced a collective trauma. Thanks to technology we have unprecedented access to data around spread, mortality rates, vaccines, and more. Also thanks to technology, we have seen the faces of the pandemic from every corner of the globe. There is a population in the United States, however, that has struggled to get their experiences with this devastating pandemic out to the public.

As of December 2020, the rate of COVID-19 among prisoners in the United States was 4 times as high as than the general population. Even as the first Americans began getting their vaccines, the spread in prisons showed no signs of slowing. According to Homer Venters, the former chief medical officer at New York’s Rikers Island jail complex, those numbers are an under count. Because of crowded conditions, prisoners cannot social distance, and don't receive proper care from the medical staff inside. This is why data collection and sharing these numbers is so important.

The data for this project is collected by The Associated Press and The Marshall Project, a nonprofit investigative newsroom dedicated to the U.S. criminal justice system. It is collected weekly by Marshall Project and AP reporters who call the facilities to get "the cumulative number who tested positive among staff and prisoners, and the numbers of deaths for each group."

Data Cleaning & EDA

The data is relatively clean, and The Marshall Project has done an amazing job at aggregating it for reporting. There are some inconsistencies that appeared in the frequency for the time series aspect, but I was able to fix that by resampling the data (basically a pd.groupby for time data) at the weekly level.

Another difficult aspect of the data includes reporting of the cumulative total. If a state does not report for multiple weeks, the data stays at the same total (rather than reporting on 0 that week), and then will drastically jump when the number is eventually reported.

The goal of this project was to produce a time series model to forecast COVID-19 cases in New Jersey prisons. Here is how I went about that.

Modeling

Below is a simple plot of my Linear Time Series regression with 2 lags. The blue line is training data, the green is the data that is predicted values and the yellow is actual values. You can see that the model is attempting to account for the large spike in growth directly before it. It then goes on to correct itself, with a RMSE of 98 and R2 score of .70, making it my most successful model.

My second model, the AutoRegressive Integrated Moving Average (ARIMA) was much more involved with very little return. After testing the undifferenced data with the Augmented Dickey-Fuller test, I determined that the data was already stationary, so we did not have to difference it in the model. I struggled to make the model predict due to some frequency errors (covered in this blog post), but eventually got it to run. With an RMSE at 31, I was excited to have a better performing model! Then, I calculated the R2 score and it turned out to be 0, indicating that the model cannot account for any variance in the data.

Results & Predictions

I ended up using my first model, which was meant to be the baseline model. I successfully predicted on an unseen date, so we will find out if it was correct!

This project taught me so much about the data science process. I struggled to find a topic and data to use for it. I wanted to balance doing something I was passionate about (issues surrounding mass incarceration and social justice) with something that hadn't been done before, and something that felt useful. I also decided to learn Tableau for this presentation, and created some really fun data visualizations with this project.

Data Visualization

All data is from the second data pull in April 2021.