CSE310 – Project #3 (Solution)

Description

In recent years there has been a great deal of interest in “small-world” graphs [5]. These types of graphs are characterized by a high degree of local clustering and a small number of long-range connections, making them very efficient at transferring information. This “small-world” architecture underlies the well-known “six degrees of separation” phenomenon, in which it is believed that a connection can be found between any two people on the planet requiring no more than six intermediate links.
In this project, we represent climatological data as a graph and compute some characteristics of it to determine if it is a small-world graph.
Note: This project is to be completed individually. Your implementation must use C/C++ and ultimately your code must run on the Linux machine general.asu.edu.
You must use a version control system as you develop your solution to this project, e.g., GitHub or similar. Your code repository must be private to prevent anyone from plagiarizing your work.
The rest of this project description is organized as follows. §1 describes the sea ice data set and how to represent it as a graph. §2 describes the types of analyses to be performed on such graphs. §3 describes the submission requirements for the milestone and the full project deadlines.
1 Arctic Sea Ice
Sea ice covers most of the Arctic Ocean and plays a significant role in the global water cycle and the global energy balance. It is also considered to be a sensitive indicator of climate change. Thus, any changes in the Earth’s climate are likely to first be seen in areas such as the high Arctic.
Because of its importance as a proxy indicator of climate change, a great deal of research is conducted on Arctic sea ice. Data acquired by meteorological satellites provides one of the most effective ways to study large-scale changes in sea ice conditions in the Arctic. The longest continuous satellite record of sea ice comes from the Nimbus-7 Scanning Multi-channel Microwave Radiometer (SMMR) and Defense Meteorological Satellite Program Special Sensor Microwave/Imager (DMSP SSM/I) series of meteorological satellites. Data acquisition started in late 1978, with the first full year of data in 1979, and is ongoing. This data is maintained for the Arctic and Antarctic by the National Snow & Ice Data Center (NSIDC).
1.1 Sea Ice Concentration Anomaly Data Set
The sea ice concentration (SIC) anomaly data set that we will use consists of 27 years (1979–2005) of weekly SIC anomaly data derived from the SMMR-SSM/I passive microwave data set. An anomaly data set is when the long-term average is subtracted from the data, to remove seasonal trends, making the data more amenable to statistical analysis.
The data for each week is for a 304×448 floating point array representing the northern hemisphere. The data value at each cell (x,y) in the array represents the percentage of deviation in ice concentration from the 27-year average for a given week. For example looking at the values of the array for week 30 of 1990, at cell (100,200), the value is -4.5. This means that at cell (100,200) for week 30 of 1990, the sea ice concentration was 4.5% lower than the 27-year average value for week 30 for that cell.
Since there are 52 weeks per year and 27 years of data, there are 52 × 27 = 1,404 sea ice concentration readings for each position (x,y) over the years. Therefore, for each of the 304 × 448 = 136,192 positions there is a time series [x,y,t], 1 ≤ t ≤ 1,404, of SIC data with 1,404 values, starting at week 1 of 1979.
The data set is given as 1,404 files each containing a 304 × 448 32-bit floating point array (little-endian byte order). The format of the filenames is: diffwNNyYYYY+landmask, where wNN denotes week NN and yYYYY denotes the year. For example, diffw31y1983+landmask is the file for week 31 of 1983. The “+landmask” part of the name indicates that a landmask was applied to the data.

(a) Full Arctic region (b) Beaufort Sea subregion
Figure 1: Sample SSM/I total sea ice concentration image.
Figure 1(b) shows a subregion near the Beaufort Sea (Ellesmere Island is on the lower right.) The corresponding data set has only 63 × 63 = 3969 cells for each of the 27 years. This smaller data set is otherwise identical to the full data set and should be used for code development and testing.
1.2 A Graph Representation for the Climatological Data Set
How is this type of climatological data represented as a graph? Tsonis et al. derived a correlation-based graph G = (V,E) [1, 4]. The vertex set V corresponds to the cells, i.e., for the full SIC data set the graph has 302 × 448 vertices. To determine the edge set, the Pearson correlation coefficient (see §1.2.1) is calculated between all pairs of cells (x,y) and (x0,y0), 1 ≤ x,x0 ≤ 304, 1 ≤ y,y0 ≤ 448, (x,y) 6= (x0,y0), of time series vectors. That is, the correlation coefficient is computed between [x,y,t] and [x0,y0,t], 1 ≤ t ≤ 1,404, for each pair of distinct cells.
Given that there are n = 136,192 cells and there are n(n−1)/2 pairs of cells, the correlation coefficient is calculated for 3,547,116 pairs of time series. If the correlation coefficient for a pair of cells (x,y) and (x0,y0) of time series, i.e., [x,y,t] and [x0,y0,t], 1 ≤ t ≤ 1,404, is greater than some threshold rthresh, then an edge is inserted between cells (x,y) and (x0,y0). The final result is a graph with edges between all cells having a correlation greater than the threshold rthresh.
1.2.1 Pearson Correlation Coefficient
To get a measure of how strongly two vectors X and Y , of length n, are related, we use the correlation coefficient. The Pearson correlation coefficient measures the strength and direction of a linear relationship between X and Y . The formula for the sample correlation coefficient, denoted by r is:

where,
, and .
In our case the vector X corresponds to the time series of length 1,404 for cell (x,y), whereas the vector Y corresponds to the time series of the same length for cell (x0,y0). You will need to compute the sample correlation coefficient between all pairs of distinct cells (x,y) and (x0,y0). For a given pairs of cells (x,y) and (x0,y0), if |r| ≥ rthresh then you should insert an edge in the graph between the vertex corresponding to (x,y) and the vertex corresponding to (x0,y0). (The absolute value |r| of the correlation coefficient should be used, since r can be positive or negative.)
2 Analyses of the Climatological Graph
We are interested in whether the graph derived from the SIC climatological data is a small-world graph. First, construct a correlation-based graph Gr = (Vr,Er) for the sea ice anomaly dataset for each correlation threshold rthresh ∈ {0.95,0.925,0.9}. Use an adjacency list of represent the graph. (Start with the largest threshold as it will yield the sparsest graph.)
Now, for each correlation-based graph Gr:
1. Plot the histogram of the degree distribution of Gr. If |Vr| = n vertices, the degree of a vertex can range from zero (the vertex is isolated) to n − 1 (the vertex has an edge to every other vertex in the graph). A histogram of the degree distribution for Gr therefore plots a count of the number of vertices of each degree d, 0 ≤ d ≤ n − 1.
Recall that a small-world graph is characterized by a high degree of local clustering and a small number of long-range connections. As a result, we expect degree distribution of a small-world graph to have a long-tailed distribution.
For a small-world graph, what do you think the component structure should look like?
3. Another way to determine if the graph Gr is a small-world graph is to calculate the mean clustering coefficient γ(G) and the characteristic path length (or diameter), L(G), of Gr and compare it to a random graph Grandom of the same size (i.e., with the same number of vertices). (These characteristics are defined in §2.1 and in §2.2, respectively.) For a small-world graph, ) and
L(Gr) ≥ L(Grandom) [3, 5].
Compute the clustering coefficient, γ(Gr), and the characteristic path length, L(Gr), of the graph Gr.
4. Compare γ(Gr) and L(Gr) to γ(Grandom) and L(Grandom) for the random graph, Grandom, of comparable size. (See §2.3 for details.)
2.1 Clustering Coefficient
The neighbourhood N(v) of a vertex v consists of all the vertices adjacent to v. The graph generated by N(v), hN(v)i has vertex set N(v) and its edges are all edges of the graph with both endpoints in N(v). Use k(v) and e(v) to denote the numbers of vertices and edges in hN(v)i. The clustering coefficient γv of v is:
.
In words, γv for a vertex v is the proportion of edges between the vertices within its neighbourhood divided by the number of edges that could possibly exist between them.
The clustering coefficient of a graph G is the mean of the clustering coefficient of all vertices of G and is denoted γ(G).
2.2 Characteristic Path Length
Let di,j be the length of the shortest path between the vertices i and j. Then the characteristic path length L(G) for the graph G = (V,E), is di,j averaged over all pairs of vertices, where n = |V |. Use the
Floyd-Warshall all-pairs shortest paths algorithm.
2.3 Metrics for Random Graphs
A random graph Grandom corresponding to Gr has the same number of vertices as Gr, namely n = |Vr|. However, edges are inserted into Grandom at random such that the edge density of Grandom matches the edge density of Gr, i.e., the probability p that edges are present in Grandom should match that of Gr. This edge probability p is calculated from Gr having n vertices and mean vertex degree k as:
.
This calculation is derived intuitively by reasoning that the fraction of actual edges out of the total number of possible edges should approximate the edge probability of the graph.
The clustering coefficient of a random graph . Similarly, the characteristic path length of a random graph . Here, n is the total number of vertices in the correlation graph Gr, and k is the mean vertex degree of Gr.
Given the degree distribution of Gr it is straightforward to compute k.
An unlimited number of submissions are allowed. The last submission will be graded.
Using the submission link on Canvas for the Project #3 milestone, a zip file named yourFirstName-yourLastName.zip containing the following:
Project State (5%): In a folder named State provide a brief report that addresses the following:
1. Explain all design decisions. Discuss your representation of the graph, and any optimizations you made in computations. (Depending on the order in which you calculate the statistics, you can likely save and make use of previous results to cut down on the computation time.) Can you compute a worst-case bound on the time and/or space of your algorithms?
2. Describe any problems encountered in your implementation for this project milestone.
3. Describe any known bugs in your project milestone.
5. Cite any external code bases, books, and/or websites used or referenced.
Implementation (50%): In a folder named Code provide:
1. In one or more files, your well documented C/C++ source code implementing this project milestone.
Report (20%): In a folder named Report provide a report containing the following:
1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh ∈ {0.95,0.925,0.9}, plot the degree distribution.
(a) Do you think the degree distribution is consistent with that of a small-world graph? Why or why not?
(b) Identify any supernodes, i.e., vertices with significantly higher vertex degree than the average, and where they occur. Describe your determination of supernode.
2. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh ∈ {0.95,0.925,0.9}, compute the number of connected components in Gr and their size (i.e., number of vertices).
(a) For a small-world graph, how do you think the component structure should look?
(b) Do your results support your hypothesis?
Correctness (25%): The correctness of your program will be determined by running your program on the “small” data set for a given threshold.
The milestone is worth 30% of the total project grade.
Using the submission link on Canvas for the complete Project #3, a zip file named yourFirstName-yourLastName.zip containing the following:
Project State (5%): Follow the same instructions for Project State as in §3.1.
Implementation (40%): Follow the same instructions for Implementation as in §3.1.
1. For the “small” data set, i.e., the subregion in the Beaufort Sea, for each correlation threshold rthresh ∈ {0.95,0.925,0.9}, compute the clustering coefficient, γ(Gr), and the characteristic path length, L(Gr), of the graph Gr.
2. Compute the clustering coefficient, γ(Grandom), and the characteristic path length, L(Grandom), of a random graph Grandom corresponding to each Gr.
(a) Compare γ(Gr) and L(Gr) to γ(Grandom) and L(Grandom) for the random graph, Grandom, of comparable size.
(b) What conclusion(s) can you draw from your results?
Correctness (30%): The same instructions for Correctness as in §3.1 apply.
Bonus (10%): Repeat the analyses on the full data set. (I know that this is a lot of extra work for a bonus; it is a good indicator of the scalability of your code.)
Acknowledgements
Thanks to my former student Wayne S. Chan who motivated this project.
References
[4] M. Tumminello, D. Matteo, T. Aste, and R. N. Mantegna. Correlation based networks of equity returns sampled at different time horizons. European Physical Journal B, 55:209–217, 2007.
[5] D. J. Watts and S. H. Strogatz. Collective dynamics of “small-world” networks. Nature, 393:440–442, 1998.

Reviews

There are no reviews yet.

Be the first to review “CSE310 – Project #3 (Solution)”

CSE310 – Project #3 (Solution)

Description

Reviews

Related products

CSE310 – Department of Computer Science & Engineering Solved

CSE310 – Assignment 2 Solved

CSE310 – DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Solved

CSE310 – Department of Computer Science & Engineering Solved

CSE310 – Project #1 (Solution)