Stock data analysis and visualization
For investors to properly manage their portfolios, they need to visualize datasets, find interesting patterns and gain valuable information such as stock daily returns and risk
Let’s first read the stock data csv file, here’s the list of the stocks considered:
- AAPL = Apple Stock
- BA = Boeing
- T = AT&T
- MGM = MGM Resorts International (Hotel Industry)
- AMZN = Amazon
- IBM = IBM
- TSLA = Tesla Motors
- GOOG = Google
- sp500 = US Stock Market (S&P 500 is a stock market index that measures the stock performance of 500 large companies listed on U.S. stock exchange)
Check the list of S&P 500 companies here: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
stocks_df = pd.read_csv(...path to csv...)
We have data from 2012 up to 2020, we can look for the impact of covid pandemic in stock market. We can ask questions like ‘ How did the pandemic affect amazon stock, How did the pandemic affect Hospitality industry
We know that covid pandemic had a huge impact in the hospitality industry and stock values of MGM suffers , at the same time pandemic created a positive market for the amazon. People started purchasing from online , working online etc.. during the pandemic and this created a positive environment for tech industries. For Boeing Since there was no travel during the pandemic , its stock value also suffered.
We can ask many interesting questions , make hypothesis and validate it against the data. We can discover interesting patterns from the data and study them. Now you must have understood the importance of data analysis and visualization in Machine Learning.
We have a dataset that give us information about different stocks recorded overtime. In the CSV file we have a ‘date’ column that give us information about the order in which the events occurred. What kind of data is this ?
What makes data a sequential data ? or what do we mean by sequence data?
Sequence data is a collection of events happening in time. Data is considered sequential only if earlier observation tell us information about later observation. In a time-series events are recorded with regular frequency.
Exploratory Data Analysis
Whenever we deal with real-world data — always check for null values. Because real world is never perfect and as an analyst we must be aware of it.
stock_df.isna().sum() # list the count of null values
stock_df.info() # give us information about the dataframe
stock_df.describe() # give us a summary of the data
Lets plot the data and look for interesting patterns
# lets plot the data
stock_df.plot(x='Date', figsize=(20,8));
# Interactive plots with plotly
import plotly.express as px
fig = px.line()
for col in stock_df.columns[1:]:
fig.add_scatter(x=stock_df['Date'], y=stock_df[col], name=col)
fig.show()
Daily return on a stock
Daily return on a stock is used to measure the day to day performance of stocks, it is the price of stocks at today’s closure compared to the price of the same stock at yesterday’s closure. Positive daily return means appreciation in stock price on daily comparison.
How do I calculate Daily return on stock?
- Subtract the opening price from the closing price.
Locate the opening price of the stock and the closing price. Find the difference between the two prices to calculate the price change - Multiply the difference by the stocks you own
Find the total number of stocks, also known as shares, you own in the company. Take the difference between the opening and closing prices and multiply it by the shares you own to determine exactly how much your stock increased (or decreased) in value that day. - Divide the daily return by the current stock price and multiply by 100 to get a percentage.
If you want to find the percentage of your stock’s daily return, take your daily return and divide it by the current stock price. Then, take that value and multiply it by 100 to find out the percentage of the return
daily_return = {}
for col in stock_df.columns[1:]:
stock = stock_df[col].values
daily_return[col] = [0]
for idx in range(1,len(stock_df)):
# Subtract the opening price from the closing price.
diff = stock[idx] - stock[idx-1]
# Divide the daily return by the current stock price and multiply by 100 to get a percentage.
percentage = (diff/stock[idx])*100
daily_return[col].append(percentage)daily_return_df = pd.DataFrame(data=daily_return)
daily_return_df['Date']=stock_df['Date']
daily_return_df.head()
# Lets plot the daily returns
daily_return_df.plot(x='Date', figsize=(15,5));
Covariance and Correlation
What is covariance?
Suppose we have data about number of green apples from five different shops in our area. Let’s use variables { A, B, C, D, E } for representing the shops
We can apply statistics and find mean and standard deviation with this data. mean give us information about the average number of green apples available in our area and standard deviation quantifies the variation around the mean or simply spread of the mean.
Similarly suppose we have data about number of red apples from the same five shops { A, B, C, D, E } in our area.
Now since we now have data about both green apples and red apples from five different stores, We can look at the data in pairs .
We can now ask questions like did the measurements taken as a pair tell us anything different that the individual measurements do not ?
Covariance is one way to answer this question
The main idea behind covariance is it can classify three kind of relationship
- Relationship with a Positive trend
- Relationship with a Negative trend
- No Trend
Covariance = Positive value indicate a Positive trend
Covariance = Negative value indicate a Negative trend
Covariance = Zero indicate no trend
Let’s plot the data in XY-plane, we are looking at data in pairs.
X axis = Green apples data from { A, B, C, D, E}
Y axis = Red apples data from { A, B, C, D, E}
If we see that low values of green apples corresponds to low values of red apples and high values of green apples corresponds to high values of red apples — What we are seeing is a Positive trend , X-axis values and Y-axis values are increasing together.
If we see that low values of green apples corresponds to high values of red apples and high values of green apples corresponds to low values of red apples — What we are seeing is a Negative trend , as X-axis values increases the Y-axis values are decreases.
Now if we see that for different values of X-axis , the Y-axis value is the same or for different values of Y-axis, the X-axis value remain the same. In this case what we are seeing is called No Trend, because there is no relationship exist.
Covariance in and of itself is not very useful , that is we’ll never calculate the covariance and be done for the day. Instead the covariance is a computational stepping stone into something more interesting like correlation, PCA etc…
How do we calculate Covariance ?
Suppose we have genetics data from five different cells
- We first compute the mean of X →x_mean and mean of Y →y_mean
- Then we subtract the means from each individual observations to obtain x_diff and y_diff.
- In the next step we multiply corresponding x_diff and y_diff
- Final we divide the summation by the number of observations
Why is covariance not very interesting ?
Covariance values are highly susceptible to the scale of the data, if we change the scale of the data the covariance value changes even-though the distribution is not changing.
Covariance value tell us about the trend , positive trend means a positive slope and negative trend means a negative slope. It doesn’t tell us anything about the steepness of the slope or the strength of the relationship
Correlation ( R )
Correlation give us a value between -1 and +1
- -1 implies negative trend , slope is going downward
- +1 implies positive trend, slope is going upward
Correlation tell us how strong the relationship is. Suppose we model the relationship with a line and datapoint fall close to the line this tell us that there is a strong relationship.
If the datapoints falls further from the line, then the relationship is weak.
Let’s plot the apples data in XY-plane, we are looking at data in pairs.
X axis = Green apples data from { A, B, C, D, E}
Y axis = Red apples data from { A, B, C, D, E}
Let’s model the relationship using a straight line.
If the datapoints are closer to the line, then give a value for X-axis we can predict the corresponding value in the Y-axis within a very narrow range.
This implies a strong relationship
If the datapoints are far away from the line, then give a value for X-axis we can predict the corresponding value in the Y-axis within a wide range of values
This implies a weak relationship
Suppose we have genetics data from five different cells:
After collecting data from the cells, suppose we observe a trend that — higher values of gene X in a cell corresponds to higher values of gene Y in the same cell and lower values of gene X corresponds to lower values of gene Y.
Covariance tells us about the trend — there is a positive trend in this case.
Correlation tells us about strength of the relationship.
What is the strength of relationship between gene X and gene Y in a cell?
Note : all we are saying is that we observed a trend positive or negative.
The positive trend : low values of X corresponds to low values of Y and high values of X corresponds to high values of Y.
The negative trend : low values of X corresponds to high values of Y and high values of X corresponds to low values of Y.
The observation suggest a trend and we can use it make predictions , that is an educated guess. We are not saying that Low values of X is causing the Y values to be low or anything else … That means we are not ruling out the possibility that something else is causing the trend we observe.
Correlation values are not affected by the scale of the data. and we can verify whether this correlation is due to random chance or not by using a p-value.
P-value tells us the probability that randomness will result in a similar strong or weak relation. The smaller the p-value, more confidence we’ve in the model.
Confidence in the model also scales with number of data-points that we used to fit the line. More data implies more confidence in the model that fits the data.
Using correlation we can get insight about the strength of a relationship but with correlation alone we can’t compare two relationship and say : a relationship with correlation =0.7 is twice as good as a relationship with correlation =0.5
For that we need R — Square
R square
With R-square interpretation is easier. Just square the R-values
R = 0.7 then Rsquare = 0.49
R = 0.5 then Rsquare = 0.25
Now we can see that a relationship with correlation =0.7 is twice as good as a relationship with correlation =0.5
R-square range from 0 to 1
R-square compare the variation around the mean-line with variation around the fit.
Let’s look at an example:
Suppose :
Variance around mean = 32 and Variance around line = 6
R-square = (32–6)/32 = 0.81
This means there is 81% less variation around line than around the mean.
So the model fits the data good, and we can use it make predictions.
Like always the confidence depends on two things
- How many samples/observations do we have
More data implies more confidence - p-value
A small p-value tell us that there is a low probability for randomness to fool us with a better fit
Let obtain correlation matrix of daily returns
# correlation matrix of daily returns
correlation = daily_return_df.drop(columns=['Date']).corr()
sns.heatmap(correlation, annot=True);
Histogram of daily returns
Investors want to reduce risk the face in stock market, they want to invest in stocks that give back positive return.
By using a Histogram we can get insights about the distribution of daily_returns of a stock .
From the distribution returned by a histogram we can find out which stocks are more likely to give us a positive return and which are risky to invest.
Histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins.
If you want to use a distribution to approximate your data or a future measurement, histogram is a good way to justify your decision.Depending on the distribution we might use a normal curve to approximate the measurements
Tricky part while using a histogram is finding the number of bins. We are not going to get any helpful information, if the number of bins are two small or if too large.
Let’s say the bins are too narrow — such that there is a bin for each observation (too many bins) , it’s not useful as it says nothing.
Now what if bins are too wide — such that there are only two bins for each observation (too few bins) , it’s not useful as it only give us a 50–50 split.
Daily returns if negative means we are losing money and positive means we are gaining. So if we use a histogram with appropriate number of bins, we can gain insights into how often are we losing money and how often are we gaining money
We can approximate the daily return data of a stock with a normal curve and make intelligent decisions.
# Interactive plotimport plotly.figure_factory as ffdata = []
for vals in daily_return.values():
data.append(vals)
fig = ff.create_distplot(data, stock_df.columns[1:])
fig.show()