I recently attempted an Auto Regressive Integrated Moving Average (ARIMA) model for my time series data, COVID-19 cases in New Jersey prisons. After creating a successful linear time series regression, I moved on to the slightly more complex ARIMA. Step by step, I differenced the data, fit the model, and tried to run predictions. But it kept coming back with the error:
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I could not figure out why because the index was referencing a date, and it was the correct format, because I used this code to reindex:
pd.to_datetime(df['date column'])
df.set_index('date_column', inplace = True)
With the help of my good friend Stack Overflow , I learned that the date time index has to have a consistent frequency. For example, if the first record is on March 26, 2020, the next record is on April 1, 2020 (6 days difference), and the third is April 10, 2020 (9 days difference), the time lapsed between Index[0], Index[1] and Index[3] do not match the rest of the dataset. This can cause an error down the line in the ARIMA model.
To fix this, I used resampling. For time series data, resampling is like a df.groupby(). It aggregates the data according to a time period, such as year (Y), month (M), or day (D). I utilized the weekly (W) function and was able to aggregate the data to consistent frequency with the below code:
df = df.resample('W').sum()
After I made sure the data was in the correct format, I chose my target column, parameters, and train/test split. Finally, I was able to make predictions based on my trained model.
I hope this information helps some trouble data scientist/data science student someday. It took hours of research and rerunning models, but at least I now know how to create an accurate and consistent time series data set for future models!