I'm a tech creator
In the Internet industry, the e-commerce field is definitely the most used place for data analysis, and major e-commerce platforms rely on data analysis to help them tap into user order growth opportunities. For example, the core idea of buying a treasure is to recommend it according to the user's daily browsing content and stay time, as well as the relevance of the order.
In this article, we will analyze through a real e-commerce data set, and on the basis of reviewing the previous content, we can also feel the analysis process of e-commerce data analysis.
Recently, an e-commerce company** needed to plan a promotion campaign to send ads and offers to customers by sending text messages to attract them to shop. However, due to budget and SMS limitations, it is impossible to send to all customers, so it is necessary to find out the people who are most likely to convert and send promotional information in a targeted manner.
Based on the above requirements, as a data analyst, we need to develop our own analysis plan according to our needs. Then our task is:
Through data analysis, find the characteristics of the population that are most likely to convert (such as age, gender, region, etc.). Through data analysis, we can give the most suitable time to send promotional SMS. After the task is clear, we need to consider what data support we need to complete the above tasks, and start to find the data department to provide corresponding data support.
Through a meal of friendship (lip) good (gun) cooperation (tongue) consultation (war), the following data is finally obtained from the data department:
User behavior table: user behavior data for the last 6 months. That is, the order data VIP data: the user VIP member opening data. User Data: Data related to the user's personal information. Once we have the data, we can do our best.
In order to facilitate our learning, we need to simulate some relevant data by ourselves, if you don't want to simulate, you can contact *** to obtain.
After getting the data, you can see the following files after decompression:
user_beh**ior_time_resampled.CSV (User Behavior Data) VIP UsersCSV (VIP User Data) User Infocsv (user data) Let's first take a look at the field descriptions of each table:
user_beh**ior_time_resampled.csvvip_users.csv
user_info.csv
From here, we'll start reading data using some of the packages and libraries we've learned about earlier, starting with pandas to load the data.
import pandas as pd
df_user_log = pd.read_csv("ecomm/user_beh**ior_time_resampled.csv")
df_vip_user = pd.read_csv("ecomm/vip_user.csv")
df_user_info = pd.read_csv("ecomm/user_info.csv")
df_user_log
df_vip_user
df_user_info
Once loaded, the output looks like this:
Here, there is a time stamp and timestamp fields in the df user log table, and we need to understand what these two fields mean.Let's take a look at the boundary values of these two fields.
time_stamp_max = str(df_user_log['time_stamp'].max())
time_stamp_min = str(df_user_log['time_stamp'].min())
print("time_stamp max: " + time_stamp_max, "time_stamp min: " + time_stamp_min)
timestamp_max = str(df_user_log['timestamp'].max())
timestamp_min = str(df_user_log['timestamp'].min())
print("timestamp max: " + timestamp_max, "timestamp min: " + timestamp_min)
The output is as follows:
time_stamp max: 1112, time_stamp min: 511
timestamp max: 86399.99327792758, timestamp min: 0.10787397733480476
As you can see, the maximum value of the time stamp is 1112, the minimum value is 511, and the maximum value of the timestamp is 8639999 has a minimum value of 01。
From the description of the dataset, the user behavior table is the user's behavior for 6 months, and the time stamp is up to 1112 and the minimum is 511 to look very much like a date. Represents the minimum date is May 11 and the maximum date is November 12.
So since the timestamp is a date, will the timestamp be a specific time? The maximum value of a timestamp is 86399, while the maximum number of seconds in a day is 24*3600 = 86400. If the two numbers are very close, you can assume that the timestamp represents the number of seconds of the day when this behavior occurred.
Solved the problem of two time fields, and to avoid ambiguity, we renamed the time stamp column to date.
df_user_log.rename(columns=, inplace = true)
df_user_log
Once we've read the data and understood the meaning of each field, we can start cleaning the data.
For the datasets used in data analysis, we need to understand the integrity of the data as much as possible, and it doesn't matter if the fields that are not related to our data analysis are not cleaned. However, if there is a missing part of the key analysis dimension, we need to consider whether to complete it or remove it directly.
Let's take a look at the missing values first:
df_user_log.isnull().sum()
The output is as follows:
user_id 0
item_id 0
cat_id 0
seller_id 0
brand_id 18132
date 0
action_type 0
timestamp 0
dtype: int64
Judging from the above results, there are more than 18,000 data in the log table that lack brand data, and the missing rate is relatively low by 016%(1.8w 1098w), generally this order of magnitude will not affect the overall rigor of data analysis, and we will not deal with it for the time being.
df_user_info.isnull().sum()
The output is as follows:
user_id 0
age_range 2217
gender 6436
dtype: int64
As can be seen from the results, there are 2217 missing age data and 6436 missing gender fields in the info table. But we have an analysis of the user's age and gender, and the completion is completely irregular, so here we will delete it directly.
df_user_info = df_user_info.dropna()
df_user_info
df_vip_user.isnull().sum()
Output:
user_id 0
merchant_id 0
label 0
dtype: int64
Judging from the results, the VIP table is not missing and does not need to be processed.
After the above preparations, it is time to start our core data analysis work.
Remember our analytical tasks? The first is to target the people who need to be promoted, and the second is to determine the time when the promotion message will be sent. So let's focus on our two tasks, let's do the next work.
Let's take a look at the age distribution through the value counts function of the dataframe
df_user_info.age_range.value_counts()
Output:
name: age_range, dtype: int64
Except for the unknown data, we can find that the values of 3 and 4 account for the largest proportion. 3 and 4 in turn represent 25-30 years old and 30-34 years old, respectively. We then use ** to calculate the proportion of users aged 25-34.
user_ages = df_user_info.loc[df_user_info["age_range"] != 0, "age_range"]
user_ages.loc[(user_ages == 3) |user_ages == 4) ]shape[0] / user_ages.shape[0]
Output:
It can be seen that the proportion of users in the 25-34 age range is 58%.
(2) User gender analysisNext, we use the value counts function to analyze gender.
df_user_info.gender.value_counts()
Output:
name: gender, dtype: int64
In terms of field meaning, 0 represents female, 1 represents male, and 2 represents unknown. From this, it can be concluded that the core user group of the platform is female, and the number is 2 of that of men35 times.
Up to now, through the analysis of user groups, we can already conclude that the core users of the platform are women aged 25-34, but is this situation realistic? After all, we only analyzed the registered user information, and did not combine it with the order data. Maybe it's just that there are more women registering, but fewer are ordering. So the next step is to combine user information and order data to verify whether the conjecture is reasonable.
As mentioned above, we need to combine user information and order information to analyze whether it is women who have stronger purchasing power. But the user data and the order data belong to the tables that are not used, so what do we do? If you look at the data, we can see that both the user table and the order table have a field called user id, so that we have a way to associate the two tables.
Associate two tables by user id:
df_user_log = df_user_log.join(df_user_info.set_index('user_id'), on = 'user_id')
df_user_log
Output:
As you can see from the above output, the age and gender of the user table are merged into the order table. Next, we can analyze the gender and age of the user according to the user who placed the order.
df_user_log.loc[df_user_log["action_type"] == "order", ["age_range"]].age_range.value_counts()
Output:
name: age_range, dtype: int64
Through the above results, it can be seen that the age group of the order and the analysis of user information are basically the same, and the proportion of people aged 25-34 is 599%。
df_user_log.loc[df_user_log["action_type"] == "order", ["gender"]].gender.value_counts()
Output:
name: gender, dtype: int64
From the above results, it can be seen that it is still women who place more orders. At this point, we can basically draw conclusions:We send promotional messages to female users aged 25-34. At this point, we are almost halfway through the task, and we have already determined the group to send the text message. But another task is to determine when to send. Let's move on.
Here we group each date to see which time period has the most people placing orders. Since the data is the last 6 months, then we will divide the data into 6 groups and take a look:
df_user_log.loc[df_user_log["action_type"] == "order", ["date"]].date.value_counts(bins = 6)
Output:
name: date, dtype: int64
It can be seen that the largest number of orders were placed from October 11 to November 11. After analyzing the date, let's take a look at which time period has more orders.
The timestamp field stores the time when each record was ordered, the number of seconds accumulated from midnight on the current day. It's not intuitive, and we prefer to be able to analyze it based on hour-level data. So let's consider creating a new column of time based on the timestamp column to represent the hour.
df_user_log.loc["time_hours_view"] = df_user_log["timestamp"]/3600
df_user_log
Output:
We can directly use Value Count to count the newly added Time Hours View field, and then we can distribute the hour-level distribution of a day. We looked at the distribution on a two-hour scale, so we divided it into 12 groups.
df_user_log.loc[df_user_log["action_type"] == "order", ["time_hours_view"]].time_hours_view.value_counts(bins = 12)
Output:
name: time_hours_view, dtype: int64
As can be seen from the above results,Eight to ten p.m. is the most orderedAt this point, we have completed the task of analyzing the data according to the requirements. It has been basically determined that the promotion of SMS sending groups is:For female users aged 25-34, the best time to send text messages is from 8 to 10 p.m. from late October to mid-November
Welcome to pay attention to *** server-side technology selection.
If you have any questions or other needs, you can leave a message.