Data analysis in python

My days in Matlab is definitively not over, but it's a clear trend that more and more people I work with tend to use open source software tools for data analysis and visualization. This has motivated me to test alternative tools and compare them with matlab with respect to functionality and performance.

Recently, I've been working with correlation of data from gas sensors. For the SINTSENSE concept, I collect data from the sensors into an Influx 1.x database and plot data using grafana. I found not functionality to plot the correlation between sensors, so I resolved this by extracting data from the database every hour, calculate the correlation cofficients and then push them back into the database. This data was then plotted in grafana alongside the sensor responses.

The work with SINTSENSE made want to explore the possiblities of doing correlation analysis on timeseries data. The calculation of the correlation is straightforward, but in order to calulate dynamic correlation, data must be divided into time segments on which calulations is performed.

To test how this could be done in Python, I decided to grab some meteorological data from my Netatmo stations.: For outdoor temperature and pressure I would expect to see a correlation.

To plot temperature date, I used pandas for importing data as data frames: 

ot = pd.read_csv('c:/temp/netatmo_ot.csv', sep=';', header=2)

Plotting the data is straightforward with matplotlib. Plotting was very slow, and manual conversion of the timestamp to datetime sped up plotting considerably:

plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'], color='lightgray', marker='o')


Smoothing of the data was performed quite elegantly with the rolling functionality. Specifying center as reference was ncessary in order to avoid offset:

plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'].rolling(10, center=True).mean(), 'r-')

Pressure data was treated in the same way.


In order to calculate the correlation, the rolling function was combined with the corr function in a very elegant way:

ptcorr = ot['temperature'].rolling(1000, center=True).corr(ip['pressure']).

The rolling function drops 1000/2 points in each end of the series. The comparison is performed between dataframes of length 8955 and 8942 and resulted in a series of length 8955 and therfore the datetime column for the longest dataframe was used. With 5 min time resolution, 13 points amount to a time offset of about half an hour. Over 30 days, this is acceptable for this purpose

In order to get a sound time resolution, the rolling interval was upped to 1000. Plotting as before. 


The plot show that there is a negative correalation between temperature and pressure. This is expected from the known observation of cold weather and high pressure in Winter. Negative correlation is larger than -0.8 and occasional positive correlation is much smaller.


Code:

# -*- coding: utf-8 -*-

"""

Created on Tue Jan 12 19:54:54 2021

@author: taarhaug

"""

import pandas as pd

import matplotlib.pyplot as plt

#read data from csv files with ';' separator and two header lines ignored

ot = []

ot = pd.read_csv('c:/temp/netatmo_ot.csv', sep=';', header=2)

ip = []

ip = pd.read_csv('c:/temp/netatmo_ip.csv', sep=';', header=2)

#plot temperature data

# pyplot very slow, convert to datetime manually

plot1 = plt.figure(1)

plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'], color='lightgray', marker='o')

plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'].rolling(10, center=True).mean(), 'r-')

plt.plot(pd.to_datetime(ot['Timestamp'], unit='s'), ot['temperature'].rolling(100, center=True).mean(), 'k-')

plt.grid(True)

plt.title('Netatmo temperature data')

plt.ylabel('Temperature /°C')

plt.show()

# pressure plot

plot2 = plt.figure(2)

plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ip['pressure'], color='lightgray', marker='o')

plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ip['pressure'].rolling(10, center=True).mean(), 'r-')

plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ip['pressure'].rolling(100, center=True).mean(), 'k-')

plt.grid(True)

plt.title('Netatmo pressure data')

plt.ylabel('Pressure /mbar')

plt.show()

#correlation calculation

ptcorr = ot['temperature'].rolling(1000, center=True).corr(ip['pressure'])

#plot corrcoeff

plot3 = plt.figure(3)

plt.plot(pd.to_datetime(ip['Timestamp'], unit='s'), ptcorr)

plt.grid(True)

plt.ylabel('Corrcoef')

plt.show()



Comments

Popular posts from this blog

Eastern Europe by motorcycle

Chigee AIO-6 MAX: First impressions

NHL 06 PS2 Online