Learn how to create an algorithm that can predict user behavior with artificial intelligence

The association aims to predict the possibility of a future or missing connection between the contract in the network. It is widely used in various applications, such as social networks, recommendations and biological networks. We will focus on association prediction in social networks, and for this we will use the same set of data used to predict the local association with DGL in the previous social network data – Twitch. This data set has a graph with its contract, which represents Twitch users and edges that represent the mutual friendship between users. We will use this to predict new links (“follow”) between users, based on the current links and user features.
As shown in the graph, the association includes multiple steps, including importing, exporting and exporting data, pre -processing, training a model, improving control devices in this, and finally preparing and ending the inferiority point that generates actual predictions.
In this post, we will focus on the first step of the process: preparing and downloading data in the Neptune Group.
Converting data into Neptune’s loader format
Initial files in the data collection look like this:
(Initial) heads:
id,days,mature,views,partner,new_id
73045350,1459,False,9528,False,2299
61573865,1629,True,3615,False,153
...
Edges (first):
from,to
6194,255
6194,980
...
To download this data to Nevtune, we first need to convert data into one of the supported formats. We will use Gremlin, so the data in CSV files should be with visions and edges, and the names of the columns should be followed in CSV files this style.
Here is what the transformed data looks at:
Heads (converted):
~id,~label,days:Int(single),mature:Bool(single),partner:Bool(single),views:Int(single)
2299,"user",1459,false,false,9528
153,"user",1629,true,false,3615
...
The edges (converted):
~from,~to,~label,~id
6194,255,"follows",0
255,6194,"follows",1
...
This is the symbol that converts the files available with the data set into the format supported by the Neptune Loader:
import pandas as pd
# === Vertices ===
# load vertices from the CSV file provided in the dataset
vertices_df = pd.read_csv('./musae_ENGB_target.csv')
# drop old ID column, we'll use the new IDs only
vertices_df.drop('id', axis=1, inplace=True)
# rename columns for Neptune Bulk Loader:
# add ~ to the id column,
# add data types and cardinality to vertex property columns
vertices_df.rename(
columns={
'new_id': '~id',
'days': 'days:Int(single)',
'mature': 'mature:Bool(single)',
'views': 'views:Int(single)',
'partner': 'partner:Bool(single)',
},
inplace=True,
)
# add vertex label column
vertices_df['~label'] = 'user'
# save vertices to a file, ignore the index column
vertices_df.to_csv('vertices.csv', index=False)
# === Edges ===
# load edges from the CSV file provided in the dataset
edges_df = pd.read_csv('./musae_ENGB_edges.csv')
# add reverse edges (the original edges represent mutual follows)
reverse_edges_df = edges_df[['to', 'from']]
reverse_edges_df.rename(columns={'from': 'to', 'to': 'from'}, inplace=True)
reverse_edges_df.head()
edges_df = pd.concat([edges_df, reverse_edges_df], ignore_index=True)
# rename columns according to Neptune Bulk Loader format:
# add ~ to 'from' and 'to' column names
edges_df.rename(columns={
'from': '~from',
'to': '~to',
},
inplace=True,
)
edges_df.head()
# add edge label column
edges_df['~label'] = 'follows'
# add edge IDs
edges_df['~id'] = range(len(edges_df))
# save edges to a file, ignore the index column
edges_df.to_csv('edges.csv', index=False)
Allow the data access to the Neptune DB in S3: the role of IAM and the end of the VPC
After converting the files, we will download them to the S3. In order to do this, we first need to create a bucket that contains our heads. We also need to create the role of IAM that allows access to the S3 bucket (in the attached policy) and has a confidence policy that allows Neptune to take over (see screenshot).
We will add the role to our NepTune (using the Neptune Control Unit), then we will wait until it becomes active (or restart the group).
We also need to allow the movement of the network from Neptune to S3, and we do so, we need the end of the VPC Gate end of the S3 in VPC:
Download data
We are now ready to start downloading our data. To do this, we need to call the block applications interface from inside the VPC to and create two functions for download: one for Vertices.csv operations, and another for Therges.csv. API calls are identical, only the S3 object key. You should allow VPC formation and safety groups to pass through the counterpart you are running curl
To the Neptune Group.
curl -XPOST \
-H 'Content-Type: application/json' \
-d '
{
"source" : "s3://bucket-name/vertices.csv",
"format" : "csv",
"iamRoleArn" : "arn:aws:iam::account-id:role/role-name",
"region" : "us-east-1",
"failOnError" : "TRUE",
"parallelism" : "HIGH",
"updateSingleCardinalityProperties" : "FALSE"
}'
The loader applications interface responds with JSON containing functionality (” ‘Loadid‘):
{
"status" : "200 OK",
"payload" : {
"loadId" : "your-load-id"
}
}
You can check if the download has been completed using this application programming interface:
curl -XGET /your-load-id
He responds to this:
{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_COMPLETED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://bucket-name/vertices.csv",
"runNumber" : 1,
"retryNumber" : 1,
"status" : "LOAD_COMPLETED",
"totalTimeSpent" : 8,
"startTime" : 1,
"totalRecords" : 35630,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 0
}
}
Once The heads are loaded from the peaks. CSVWe can Loads download Using the same application programming interface. To do this, we just replace the peaks Edges And the operation of the first curl
Driving again.
Check the loaded data
When downloading functions, we can access the loaded data by sending Gremlin Information to the Nevtune collection. To run these queries, we can either call Neptune with a Gremlin Control Unit or use the Neptune / Sagemaker notebook. We will use the Sagemaker notebook that can be created either with the Nevtune collection or later adding it when the group already runs.
This is the query that gets the number of summits that we created:
%%gremlin
g.V().count()
You can also get top by ID and check that its properties have been loaded properly with:
%%gremlin
g.V('some-vertex-id').elementMap()
After downloading the edges, you can check their download successfully
%%gremlin
g.E().count()
and
%%gremlin
g.E('0').elementMap()
This concludes the data download part of the process. In the next post, we will consider the export of data from Neptune in format that can be used in the ML model training.