Timeseries Decompositions with Web Activity Data (Python)

Part I – Install the CLI Tool

If you have not already done so, please look at the installation guide for the CLI tool. The installation guide also assists with directory/folder structure set up for a clean work environment. While this step is not mandatory, it will make following some steps in the code below easier. The code below assumes that directory structure.

That said, if the CLI tool is installed and directory structure set up, please move to next step to retrieve data.

 

Part II – Get Data from the API

Navigate to the directory where the unpacked contents of the installation zip file are, and move to folder ‘bin’. This folder contains the relevant python scripts containing code that will allow us to read the data. So, enter the following command:

cd ~/path-to-project-folder/spd-cli-0.2.0/bin

The next command retrieves the data, which we will collect in a csv file that we name “traffic_domains_daily_google.csv”. In the command below, “[Token]” will be replaced by a valid API Key (contact your 7Park sales representative to request a key). The example we are using takes the following values:

Domain: google.com

Country Code: US

Platform: PC

Date Range: 2013-01-01 to 2017-07-23

python get_traffic_domain_usage_daily.py –domains google.com –country_codes US –platform PC –date_from 2013-01-01 –date_to 2017-07-23 –token [Token] –format csv > ../../output/traffic_domains_daily_google.csv

 

Part III – Setting Up Python Tools to Start Exploring the Data

Now that we have the data in a csv, let’s do some analysis.

To interact with the data, we will be using python in this example, and more specifically jupyter notebooks in python. So, let us check if we have Jupyter Notebook installed in our python library.

The command below will show you the list of packages installed on python, see if jupyter notebook is in there.

pip list

If yes, ignore the following command. If not, let’s install it with the following command.

pip install jupyter

Now that the test is complete, let’s navigate to the notebooks folder we created in Part I and then launch jupyter notebook:

cd ../../notebooks/ jupyter notebook

With the execution of the above command, the current directory’s (/notebooks) contents will have opened up in your default browser (this should be an empty folder right now). Click on ‘New’ in the top right corner to launch a new python notebook, choose ‘root’ or ‘python2’ from the dropdown menu. A new tab will open, adjacent to the current tab, with a new jupyter notebook. Now we can start writing some code and exploring the data.

Note:

To start our exploration, we will be taking advantage of python’s libraries/packages (just as we did with jupyter notebook). So, our firs step will be to import the relevant packages in python.

If you do not have a package, simply navigate to the terminal window (or open a new terminal window) and run

pip install [enter package name]

For example, if you do not have pandas installed as a library, in the terminal window type:

pip install pandas

If the installation worked without errors, navigate back to the jupyter notebook in the browser and rerun the import command. This time it should work.

 

Part IV – Working with the Data

Now that you are in the jupyter notebook in your browser, we can start writing some python code to work with the data.

In this notebook, we will be:

  • loading the data
  • viewing it
  • making transformations on data types (especially date columns)
  • aggregating from daily timeseries to weekly timeseries
  • plotting the timeseries
  • applying seasonal decomposition to the weekly and monthly timeseries

To get started, follow the commands shown below.

 


 

Python code begins here:

Import relevant libraries.

If a package is not installed, refer to above guidelines to install it

 

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import datetime
plt.matplotlib.style.use('ggplot')
from scipy.stats import zscore
import statsmodels.api as sm
from pylab import rcParams
%matplotlib inline

 


 

Reading and transforming 7PD data from our outputs above

Read the data in and see what it looks like. In this step, we also append the column names with a ‘7P’ prefix to show they are 7P data columns. While not so important here, often times merging with data sets from other sources with similar column names can cause confusion.

 

# Read in the data
data = pd.read_csv('../output/traffic_domains_daily_google.csv')

# Add 7P prefix to all 7Park data DataFrames to avoid confusion
data.columns = ['7P_' + s for s in data.columns]

# Let's look at what this data looks like
display('Data:',data.head())
display('Data Types:', data.dtypes)

 

Data:

7P_domain 7P_company 7P_symbol 7P_country_code 7P_date 7P_visitors-unique 7P_visitors-total
0 google.com Google NaN US 2014-01-01 50.0885 1633.0263
1 google.com Google NaN US 2014-01-02 53.2880 1847.0312
2 google.com Google NaN US 2014-01-03 54.3325 1843.2766
3 google.com Google NaN US 2014-01-04 55.0882 1795.5004
4 google.com Google NaN US 2014-01-05 57.0284 1939.4533

 

Data Types:

7P_domain              object
7P_company             object
7P_symbol             float64
7P_country_code        object
7P_date                object
7P_visitors-unique    float64
7P_visitors-total     float64
dtype: object

 

As you will see above, the datatype of the 7P_date column is ‘object’. To perform any kind of date manipulations, we need pandas (python library) to understand that it is a date column. So, in the next chunk of code, we change the data type of the date column in the dataframe above from object to datetime to perform date manipuluations. In addition, to ensure any missing days of data do not cause problems in the future, we create a continuous date series and append our dataframe to it. Lastly, we add an ISO week column to the dataframe to eventually aggregate the data to weekly timseries

 

# change date format from object to datetime format
data['7P_date_reformatted'] = pd.to_datetime(data['7P_date'])
# drop the 'object' type column as it is now redundant
data.drop('7P_date', axis=1, inplace=True)

# make a continuous dates dataframe to merge with data above to ensure missing dates do not impact future calculations
# get bounds for this date range from the data
min_date = data['7P_date_reformatted'].min()
max_date = data['7P_date_reformatted'].max()
# calculate the num of days between the min and max date
numdays = (max_date - min_date).days
# create a list of dates
date_list= [min_date + datetime.timedelta(days=x) for x in range(0, numdays)]
# append list to a dictionary object
date_dict = {'7P_date': date_list}
# convert dictionary object to dataframe by feeding it to Pandas DataFrame() function
df_dates = pd.DataFrame(date_dict)
# add ISO Week column to the dates dataframe
df_dates['7P_iso_week'] = df_dates['7P_date'].apply(lambda x: str(x.isocalendar()[0]) + '-' + str(x.isocalendar()[1]))

# merge the dates column with the data dataframe
data = df_dates.merge(data, how = 'left', left_on = '7P_date',right_on = '7P_date_reformatted')
# drop redundant 7P_date_reformatted column
data.drop('7P_date_reformatted', axis = 1, inplace = True)


# display the data to see
display(data.head(5))

 

7P_date 7P_iso_week 7P_domain 7P_company 7P_symbol 7P_country_code 7P_visitors-unique 7P_visitors-total
0 2014-01-01 2014-1 google.com Google NaN US 50.0885 1633.0263
1 2014-01-02 2014-1 google.com Google NaN US 53.2880 1847.0312
2 2014-01-03 2014-1 google.com Google NaN US 54.3325 1843.2766
3 2014-01-04 2014-1 google.com Google NaN US 55.0882 1795.5004
4 2014-01-05 2014-1 google.com Google NaN US 57.0284 1939.4533

 

Now that the dataframe is ready for some transformations, in this step, aggregate the series by week and transform the data into a weekly timeseries. Then, scale/normalize the weekly dataframe by calculating z-scores for each numeric column. Eventually, display a small snippet of the dataframe to see what it looks like.

 

# define grouping columns
grp_cols = ['7P_domain','7P_country_code']
grp_cols_wk = grp_cols + ['7P_iso_week']

# aggregate daily data to weekly
data_wk = data.groupby(grp_cols_wk, as_index = False).agg({'7P_visitors-unique' : np.sum, 
                                                           '7P_visitors-total': np.sum,
                                                          '7P_date': max})

# normalize revenue and volume colums by calculating zscores on timeseries
zscore_cols_wk = ['7P_visitors-unique','7P_visitors-total']
for col in zscore_cols_wk:
    data_wk[col + '_z'] = (data_wk[col] - data_wk[col].mean())/data_wk[col].std(ddof=0)

# sort values in the dataframe by iso week
data_wk.sort_values(by='7P_date', inplace = True)
# set iso week as the index for the dataframe
data_wk.set_index('7P_iso_week', inplace = True)
# drop incomplete week
data_wk.drop('2014-1',inplace = True)

# Look at what this data look like:
display('Weekly Data:',data_wk.head(5))

 

Weekly Data:

7P_domain 7P_country_code 7P_date 7P_visitors-unique 7P_visitors-total 7P_visitors-unique_z 7P_visitors-total_z
7P_iso_week
2014-2 google.com US 2014-01-12 449.0139 16880.2130 -0.494805 -1.284269
2014-3 google.com US 2014-01-19 468.5534 17083.2584 0.248945 -1.242443
2014-4 google.com US 2014-01-26 475.2622 18929.3030 0.504308 -0.862164
2014-5 google.com US 2014-02-02 471.0727 18804.5464 0.344840 -0.887864
2014-6 google.com US 2014-02-09 471.0785 18898.7584 0.345060 -0.868456

 


 

Plotting the weekly time series

 

# plot this data
plt.figure() 
ax = data_wk[['7P_visitors-unique_z','7P_visitors-total_z']].plot(figsize= (20,8),title = 'Weekly Sum of Metrics, Normalized')
ax.set_xlabel("ISO Year and Week")
ax.set_ylabel("Index")
ax.legend(labels=['Unique Visitors','Total Visitors'])

 

matplotlib.legend.Legend at 0x116d08790
matplotlib.figure.Figure at 0x117125290

 

 

Seasonal Decomposition of Series

In this section, we decompose the unique-visitors’ timeseries into 3 components: trend, seasonality and residuals. For this, we will be using ‘seasonal_decompose’ from the StatsModels library in Python. We apply this decomposition on a weekly aggregated timeseries and a monthly aggregated timeseries. Then plot the 3 components.

Preparing the Data

However, first, we prepare the data for seasonal decomposition. For this, we will be using daily timeseries (loaded initially) and we will be focusing our seasonal decomposition efforts on the ‘7P_visitors-unique’ column.

 

# subset daily data to select the date and visitors-unique column
dta = data[['7P_date','7P_visitors-unique']]
# fill each missing value with the same value as it's predecessor in the DataFrame
dta = dta.fillna(method = 'pad')
# set the date column as the index of this DataFrame
dta.set_index('7P_date',inplace = True)

 

Weekly Aggregated Timeseries – Seasonal Decomposition

 

# we resample the data to an aggregated weekly series
dta_w = dta.resample('W').sum()
# finally, we apply the seasonal decomposition module to the data
res = sm.tsa.seasonal_decompose(dta_w)
# now we plot it
figure =res.plot()
rcParams['figure.figsize'] = 15, 8

 

 

Monthly Aggregated Timeseries – Seasonal Decomposition

 

# we resample the data to an aggregated weekly series
dta_m = dta.resample('M').sum()
# finally, we apply the seasonal decomposition module to the data
res = sm.tsa.seasonal_decompose(dta_m)
# now we plot it
figure =res.plot()
rcParams['figure.figsize'] = 15, 8