After several months of continuous writing, I have published a great number of blog posts and I feel it is time to organize them in one place. I cannot express how grateful I am for those who view, read, clapped, and responded to my articles. Watching the number of followers grows from zero to 3-digits has encouraged and inspired me to keep producing meaningful content. This article serves as a list of all my blog posts and will be keep updating as I write more blog posts.
Recently I got myself obsessed with a Japanese tv show. I found myself cannot stop checking on Twitter, Instagram, and a Chinese app called Douban for updates and discussions about the show. In the meantime, I ran into an introduction article about the Python library Twint, which is very convenient in gathering twitter data. While it is torturing waiting for the new episode to come out every week, I decided to use the waiting time exploring Twint and derive some insights about the show from Twitter.
This article will discuss the use of twint
, including how to installtwint
, how to look up user information, and searching historical tweets given conditions. After gathering the data, I will use the Pandas
library to clean the data and derive insights. I hope this article can get you started withtwint.
Most importantly, if you have some topics that you are very interested in exploring using data science skills, I hope my article can inspire you to structure your thoughts and guide you towards a starting point. …
It is estimated that 80% of the world’s data is unstructured. Thus deriving information from unstructured data is an essential part of data analysis. Text mining is the process of deriving valuable insights from unstructured text data, and sentiment analysis is one applicant of text mining. It is using natural language processing and machine learning techniques to understand and classify subjective emotions from text data. In business settings, sentiment analysis is widely used in understanding customer reviews, detecting spam from emails, etc. This article is the first part of the tutorial that introduces the specific techniques used to conduct sentiment analysis with Python. To illustrate the procedures better, I will use one of my projects as an example, where I conduct news sentiment analysis on WTI crude oil future prices. …
In April 2020, I attended a Bootcamp and completed the intensive 8-week data science training program virtually. Now when I look back, these two months definitely deserve the highlight of my 2020. In this article, I want to share my experience attending the Bootcamp and discuss how I benefited in seven aspects, hoping to give some insights to those considering attending one.k
There are rising numbers of data science Bootcamps as the data science positions are getting more and more popular worldwide. The Bootcamp prepares the candidates with data science training and helps them work on a data science project to show-case during interviews. …
Statistical inference is the process of making reasonable guesses about the population's distribution and parameters given the observed data. Conducting hypothesis testing and constructing confidence interval are two examples of statistical inference. Hypothesis testing is the process of calculating the probability of observing sample statistics given the null hypothesis is true. By comparing the probability (P-value) with the significance level (1-ɑ), we make reasonable guesses about the population parameters from which the sample is taken. With a similar process, we can calculate the confidence interval with a certain confidence level. A confidence interval is an interval estimation for a population parameter, which is point estimation plus and minus the critical value times sample standard error. …
California Governor Gavin Newsom has recently announced new stay-at-home orders in coping with the increased number of confirmed coronavirus cases. Following the curfew order a few weeks ago, now residents in the regions with ICU capacity below 15% in the Bay Area are advised to stay at home until January 4th, 2021. The Bay Area's economic activities are widely affected by adjusting the indoor capacities in the restaurants, shopping centers, etc., and limiting outdoor activities.
This is not the first time for residents in the Bay Area to experience this. In March 2020, the shelter-in-place orders were issued to cope with the first wave of the world pandemic. Large amounts of business, especially the small, private business, are impaired, and many employees, especially those in the service industries, are laid off. With the enormous cost of the economy, we need to ask one question — were the shelter-in-place orders help prevent the spread of the virus? This article will not discover the casual links here, but will use data visualization to show the correlation of shelter-in-place orders and the confirmed coronavirus cases. …
For various projects I have worked on using NLP techniques, I am dealing with the text data in English. What to do when the text data are not in English? This article will discuss how I derive some insights from tweets in foreign languages by analyzing the universal language: Emojis🎈.
Recently I have started a for-fun project analyzing Twitter posts about a Japanese Show I am watching. In my previous posts, I have discussed using the Twint library to gather all show-related tweets and some analysis about tweets and tweets related actions such as the number of replies, retweets, and likes. The show has broadcasted seven episodes in total, and I have gathered over 222k show related posts. …
As a Ph.D. in Economics, I have devoted myself to find the causal relationship among certain variables towards finishing my dissertation. A causal relationship is so powerful that it gives enough confidence in making decisions, preventing losses, solving optimal solutions, and so forth. In this article, I will discuss what causality is, why we need to discover causal relationships, and the common techniques to conduct causal inference.
A causal relationship describes a relationship between two variables such that one has caused another to occur. It is a much stronger relationship than correlation, which is just describing the co-movement patterns between two variables. The correlation of two continuous variables can be easily observed by plotting a scatterplot. For categorical variables, we can plot the bar charts to observe the relations. To know the exact correlation between two continuous variables, we can use Pearson’s correlation formula. The Pearson’s correlation is between -1 and 1, with the larger absolute value indicating a stronger correlation. …
I have discussed the questions to prepare in machine learning, statistics, and probability theory for data science interviews in my previous articles. In this article, I will discuss the preparation for the case study questions.
During data science interviews, sometimes interviewers will propose a series of business questions and discuss potential solutions using data science techniques. This is a typical example of case study questions during data science interviews. Based on the candidate’s performance, the interviewer can have a thorough understanding of the candidate’s ability in critical thinking, business intelligence, problem-solving skills with vague business questions, and the practical use of data science models and fundamentals. In contrast, most of the questions asked here are open-ended questions without a single correct answer. It is useful to know the pattern to answer these types of questions and structure your answers. …
Web scraping is scraping data from online, as long as it is allowed by the websites. The workflow of web scraping not only includes getting data online but also includes the process of turning the data into something readable and usable since in most cases, the data scraped are unstructured. Specifically, the steps of web scraping are: