Blog

A Fix for My ARIMA: Frequency in Time Series Data

Meryl Marie
Apr 19, 2021
2 min read

I recently attempted an Auto Regressive Integrated Moving Average (ARIMA) model for my time series data, COVID-19 cases in New Jersey prisons. After creating a successful linear time series regression, I moved on to the slightly more complex ARIMA. Step by step, I differenced the data, fit the model, and tried to run predictions. But it kept coming back with the error:

KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'

I could not figure out why because the index was referencing a date, and it was the correct format, because I used this code to reindex:

	pd.to_datetime(df['date column'])
	df.set_index('date_column', inplace = True)

With the help of my good friend Stack Overflow , I learned that the date time index has to have a consistent frequency. For example, if the first record is on March 26, 2020, the next record is on April 1, 2020 (6 days difference), and the third is April 10, 2020 (9 days difference), the time lapsed between Index[0], Index[1] and Index[3] do not match the rest of the dataset. This can cause an error down the line in the ARIMA model.

To fix this, I used resampling. For time series data, resampling is like a df.groupby(). It aggregates the data according to a time period, such as year (Y), month (M), or day (D). I utilized the weekly (W) function and was able to aggregate the data to consistent frequency with the below code:

df = df.resample('W').sum()

After I made sure the data was in the correct format, I chose my target column, parameters, and train/test split. Finally, I was able to make predictions based on my trained model.

I hope this information helps some trouble data scientist/data science student someday. It took hours of research and rerunning models, but at least I now know how to create an accurate and consistent time series data set for future models!

Data Ethics in the Wild: IRL Examples

Meryl Marie
Apr 18, 2021
2 min read

In a nutshell, the 2015 movie "Spotlight" is "the true story of how the Boston Globe uncovered the massive scandal of child molestation and cover-up within the local Catholic Archdiocese..."

I watched this movie around the same time we had our "Data Ethics" lecture in the General Assembly Data Science Immersive course. There are many documentaries, books, and articles about the ethics of using concepts like machine learning and artificial intelligence. For example, some US judges use an algorithm to predict whether a defendant is likely to reoffend. They then use that score to influence their sentence. It is not a surprise that these scores end up biased. The algorithms are built on data that is inherently biased, as our criminal justice system has historically and unfairly targeted African Americans. In 2014 Eric Holder, then US Attorney General, warned that these algorithms may interfere with "individualized and equal justice." Making assumptions based on statistics without contextual information has proven dangerous.

So what does this have to do with the movie "Spotlight"? *Spoilers ahead*

During the movie (which, again, is a true story) the investigative team at the Boston Globe interviews a former priest who claims that 6% of all priests molest minors. That means in Boston, where 1,500 priests work and reside, 90 priests could potentially be predatory towards children. Operating under this assumption, the team accesses church directories -records from the Catholic church that show how the church moved priests around to different parishes. Some priests would inexplicably move somewhere new from one year to another, and sometimes they were "unassigned" or "on sick-leave". These designations became a pattern among certain priests, and the team created a list of names to investigate.

The team of journalists operate under 2 assumptions that appear to be tenuous on the surface. 1, the claim that 6% of priests are sexual predators. This gives them a framework of 90 priests to potentially look for. The next assumption is that the priests who move from parish to parish, and under the pretense that they are "sick" or on leave are part of a broader conspiracy to cover up the sexual abuse to avoid scandal and lawsuits.

In the end, the team of journalists were correct. They publish evidence of cover-up stories of 75 priests in the city of Boston. While in this case the pattern was uncovered and successfully investigated, this story proves that without proper context, the statistics and numbers are not enough. The team went through a ruthless process of pulling court records and interviewing those connected. If they were to just publish a list of names that seemed suspect, their journalistic integrity would be in danger.

As data scientists, we must remember that each record in a spreadsheet/dataframe/database/dataset is a separate case. For whatever we investigate, whether it is temperature, prison populations, crime rates, or vaccine data - we must remember that there is qualitative data that adds context and important details. It is dangerous to be enveloped in numbers all day at your desk, and assume that when you leave work the world reflects the black and white of a computer screen. We mustn't forget the grey areas, and find inspiration in the journalistic integrity of the team at the Boston Globe - who did their due diligence to find justice.

Data Science: Recommended Books for the Plebeian

Meryl Marie
Apr 18, 2021
3 min read

I am finishing up my Data Science Immersive through General Assembly this week. First of all, what a wild ride. I am still not positive how I worked through it all - but I am on my way to finishing. Second of all, I am writing this post 2 days before our final is due at 1AM on a Saturday. Do with that information what you will, but if you are trusting me enough to read my words, I ask you to trust me enough to believe my book recommendations for the data science n00b, the pleb, the beginner.

I wanted to create a list of the helpful data science (DS, for those in the biz) books that will not only help you understand the technical aspects and coding, but also the inspiring part of data science-the creative ways to employ DS in the real world.

Without Further Ado:

1) My first recommendation is "Everybody Lies" by Seth Stephens-Davidowitz. This book is an exciting exploration of publicly available data from Google searches, and compares those searches to survey answers. One premise for his methodology is that people will google their true thoughts. Google has become our therapists, our doctors, our dictionaries, and more. Because of this, it is a goldmine into peoples' real thoughts - thoughts about sex, love, race, politics, and more. This book gave me inspiration before starting my class. Now that I am at the tail-end, I am reviewing concepts with an entirely new point of view. I highly recommend this book to see how data is used IRL!

2) My next recommendation is "Naked Statistics: Stripping the Dread from Data" by Charles Wheelan. This book is an awesome, simple review of common statistical concepts. Something great about the DS immersive environment is that you will explore complex statistical models. Saying things like "Machine Learning techniques" or "Principal Component Analysis" or "Natural Language Processing" are super intelligent-sounding, and great for resume's and cocktail parties. However, while I was learning these techniques, I sort of forgot the basics. Standard deviation, correlation, and probability are all super important for the entire course. I enjoyed reading this book to keep those concepts fresh in my mind.

3) My favorite instructor recommended "Build a Career in Data Science" by Emily Robinson and Jacqueline Nolis. It is a catchall when it comes to working in data science. I struggled a lot in the beginning of the course with the "data science process." I asked myself: "What are the steps? When do I do them? I have to go back to the first step...is that normal?" First of all - yes, the DS process is convoluted at times. But this book helps spell it out with a chapter titled "Making an Effective Analysis". It also has interviews with data scientists in the wild™️, tips on interviewing, resume-building, accepting an offer, becoming productive at your job...and more. I will be referring back to this book throughout my career. And it's written by women! #grlpwr.

5. One of the more put-together people in my class recommended "Learn Python the Hard Way" by Zed Shaw. I would literally trust this colleague with my life so I trust that this is a helpful book. I have a funny story about this . A long time ago I asked for help on social media for recommendations on learning SQL, another coding language I was familiar with early on in my career. Someone wrote "Learn SQL the hard way" on my post and I thought he was being really rude and told him that I didn't need his sarcasm. Someone else had to explain that it is a series of reference books for learning various coding languages. I haven't read them but I assume they are helpful in learning coding, as they have now been recommended to me twice.

6. My final book recommendation is LITERALLY ANY BOOK that will keep you sane. Whichever DS course you end up completing, at one point or another you will sigh and cry at the thought of thinking about statistics for one more second. If this is the case, I recommend any one of the "Harry Potter" series, the label of your favorite ice cream pint, or maybe you can just sit and stare at the wall for a few minutes.

Art

A Fix for My ARIMA: Frequency in Time Series Data

Data Ethics in the Wild: IRL Examples

Data Science: Recommended Books for the Plebeian

Life After a Data Science Bootcamp

Attention Sociology Students: Try Data Science!

The Unseen Pandemic: COVID-19 in US Prisons