42
CHAPTER 5. RESULTS AND ANALYSIS
point FFT and 32 point FFT are 24.14 µs and 3.02 µs, respectively. From
Section 4.3 we know that 1024 point FFT process takes 21.185 µs and 32
point FFT process takes 1.14 µs. Consequently, we can find the time spent
on DMA transfers for these processes. It will be equal to 2.955 µs (24.14 -
21.185) for the AXI DMA block connected to 1024 point FFT and 1.88 µs
(3.02 - 1.14) for the AXI DMA block connected to 32 point FFT.
Process
Time (clock cycles)
Time (µs)
FFT - 1024 pt.
2414
24.14
FFT - 32 pt.
302
3.02
First FFT - Total (x384)
925164
9251.64
Second FFT - Total (x6144)
1825811
18258.11
Third FFT - Total (x16384)
4868601
48686.01
First transpose
22935952
229359.52
Second transpose
23207150
232071.50
Table 5.3: Timing results of the implementation
It is clear that the second FFT does not meet the requirement found in
Chapter 3. Based on the DMA transfer time we can find a constraint on
the processing time required for this FFT process. It can be calculated by
finding the difference between the required time for the second FFT process
and the DMA transfer which will be equal to 0.34 µs (2.22 - 1.88). Hence,
the processing time for this FFT should be less than or equal to 0.34 µs.
However, according to the Core Generator tool, latency of the pipelined
32 point floating-point FFT with the maximum achievable frequency equals
to 0.415 µs. Thus, we can conclude that it is not possible to achieve the
desired processing time by using a single hardware block. Consequently,
we should add another 32 point FFT hardware block in order to meet the
requirements with the current implementation conditions.
Another conclusion that can be drawn based on the table is that the
memory transpose times are very high and will add huge delays to the sig-
nal processing time. We can see that the transpose processes together take
around 0.46 s which is a lot higher compared to the FFT processing times.
This will hinder the real time performance and will add a minimum time
constraint of 0.46 s between two consequent radar scannings.
To summarize, we see that due to the DMA transfer times the number of
FFT hardware blocks for the second and the third FFT should be increased in
order to meet the real-time requirements. In addition, the transpose time of
the memory is very high and adds a time constraint between two consequent
radar scannings and therefore prevents to meet the real-time requirements.
5.2. ANALYSIS
43
5.2
Analysis
In the previous section we found out the bottleneck of the implementation.
This section describes the evaluation of the model and presents the techniques
that can be applied to reduce the effect of the bottleneck.
5.2.1
Evaluation
Based on the results from the previous section we can conclude that com-
plete real-time performance of the architecture is not possible. The main
reason behind it is the time required for the memory transpose operations
which is the main bottleneck of the implementation (see Figure 5.1). Due
to this bottleneck, the processing time requirements for the second and the
third FFT processes can be loosened. In fact, no additional hardware blocks
are required for the current implementation of the architecture. However,
consequent radar scannings must happen with a specified interval which is
0.46 s based on the Table 5.3.
Figure 5.1: Processes and their performance
Moreover, it should be noted that precise real-time guarantees cannot be
given on the architecture. This is due to the fact that the implementation
requires frequent access to the SDRAM to which access latency is not con-
stant and can vary based on number of reasons such as cache page misses and
refresh operations of inside the SDRAM. This uncertainty can be reduced by
knowing the exact scheduling scheme of the SDRAM controller. However,
Xilinx documentation of the Memory Interface Controller does not provide
any information related with the scheduling scheme of the provided core.
A few methods have been tested to reduce the time needed for the memory
transpose operations. The first method was to use different memory banks for
the transposed data and for the data to be transposed. The intuition behind
44
CHAPTER 5. RESULTS AND ANALYSIS
it was to separate the SDRAM read and write operations, hence reducing the
time spent on page openings and closings. This was expected to allow less
time for the memory write operations, since one column of the transposed
data could be written to the memory without any more page openings. The
result of the performed test proved it to be true; the first transpose operation
in this case takes 22849682 clock cycles which is 86270 clock cycles less than
the former one.
The second method that was tested on the architecture was to add a block
RAM module to the architecture and use it as a base location for memory
transpose operations. The main intuition behind the idea was that the access
latency to the block RAM is considerably less than the SDRAM and no page
opening and closing are required for accessing the data in a random order.
Thus, the memory transpose operation would include fetching the data to
the block RAM, performing transpose operation on the data and writing
it back to the SDRAM. Consequently, the architecture was extended with
AXI BRAM Controller [22] module to provide the AXI transfers to be read
and written to the block RAM. However, the performed test showed that
this method is not very effective as the first memory transpose time became
43408097 clock cycles which is almost twice more than the former one.
Another method that can be used and was mentioned in the paper [15]
is to store the FFT outputs in multiple banks so that while reading the data
back wait cycles for opening and closing a row could be hidden. That is,
by interleaving the banks for memory accesses the data can be immediately
accessed without any time spent on waiting. This would reduce the amount of
time needed for read operations and additionally would require no transpose
operation to be performed. However, this method has a major limitation
due to the DMA transfer characteristic. It will require to use a AXI DMA
core to transfer a single data sample which is not very efficient in terms of
bandwidth since it does not take advantage of the burst transfer. We can
observe it on the timing results Table 5.3 which shows that transferring 16
KByte (8 KByte on MM2S channel and 8 KByte on S2MM channel) of data
using AXI DMA block takes 2.955 µs, whereas transferring 512 Byte (256
Byte on MM2S channel and 256 Byte on S2MM channel) of data takes 1.88
µs. We can conclude that the higher the amount of data for the transfer, the
higher the bandwidth of the transfer.
To sum up, we have described few methods that have been tested to
reduce the memory transpose time which is the main bottleneck of the im-
plementation. Application of these methods did not result in getting consid-
erable improvements in terms of operation time. Two bank implementation
of the transpose operation can decrease the amount of time required, how-
ever this change is not significant compared to the complete transpose time.
5.2. ANALYSIS
45
The other two methods described will result in the increase of the transpose
time and does not provide any value. Therefore, we can conclude here that
transpose operation using MicroBlaze is the major limitation of the imple-
mentation and has a little room for further improvement.
46
CHAPTER 5. RESULTS AND ANALYSIS
Chapter 6
Conclusion
The initial aim of this thesis was to analyze the MIMO FMCW signal pro-
cessing scheme and extend the Starburst platform based on the real-time
requirements of the automotive radar application. However, it was found
that Starburst platform is not suitable for this application and does not
provide any means to meet the requirements needed for the real-time per-
formance. Therefore, an alternative architecture was proposed based on the
signal processing algorithm and implemented on Virtex-6 FPGA. This chap-
ter describes the conclusions drawn during this process and future work that
can be carried out to improve the architecture.
6.1
Conclusions
There were three main findings after analysis of the MIMO FMCW signal
processing algorithm. First, it was found that the Fast Fourier Transform
is the core process of the algorithm. Second, the algorithm requires huge
amount of intermediate data to be stored in a memory. Third, memory
transpose operation might be required in order to have an efficient memory
access. In addition, the algorithm had to meet certain constraints to provide
the real-time performance. These constraints were based on the parameters
used in the Matlab model which was provided by the NXP Semiconductors.
The initial design idea for the implementation was based on the Starburst
platform. The idea was to use a MicroBlaze core which is the main processing
element of the Starbust platform to run the FFT algorithm. However, it was
theoretically found that FFT algorithm running on the MicroBlaze core could
not meet the processing time constraints and the real-time performance could
not be achieved with this implementation.
To meet the constrains an FPGA implementation of the algorithm was
47
48
CHAPTER 6. CONCLUSION
proposed. The MicroBlaze core was used as a main unit in the architecture
to generate the input signal to be processed and to control all the other
operations such as transposing the memory and configuring AXI DMA cores.
For the FFT operations, Xilinx’s FFT IP was found to be suitable enough to
be included in the architecture. It provides a number of architecture options
for the FFT implementation and meets the real-time constraints required by
the application.
Moreover, on-chip memory provided by Virtex-6 FPGA was found to
be insufficient for the storage of the intermediate data. Thus, it has been
decided to use the off-chip SDRAM memory for this purpose. This constraint
would require transpose operation to be performed in the memory. As it
was discovered later, this operation would be the main bottleneck of the
implementation.
The memory transpose process is the main area in which major improve-
ments need to be made. A few methods have already been evaluated, however
no solution was found that meets the real-time requirements.
It was also found that the accuracy of the range, velocity and bearing
results are limited by the resolution. The accuracy of the output can be
found by finding the half of the resolution. Range resolution of the radar is
0.15 m which means that the accuracy of the range output will be ≤ 0.075
m. The velocity resolution of the radar is 1.6668 m/s which will means
the accuracy of the output will be ≤ 0.8334 m/s. As it was mentioned in
Chapter 5, the angular resolution of the radar is not constant and changes
based on the beam width.
In addition, it should be noted that the current implementation uses
single-precision floating-point representation for the signal processing. Main
reason for that is the MicroBlaze core can generate single-precision floating-
point input data which offers more precision than fixed-point. Furthermore,
the FFT IP core supports single-precision floating-point processing as well
which makes it easier to implement in the architecture. However, the specific
advantages of using floating-point over fixed-point are not clear and further
research should be carried out to find out the trade-offs between floating-
point and fixed-point implementation.
6.2
Future Work
Further research and work should be carried out on the FPGA implementa-
tion of the algorithm to improve the performance. As it was mentioned in
the previous section, the main work should concentrate on reducing the time
needed for the transpose operations. It should be noted that the current
6.2. FUTURE WORK
49
analysis and implementation assumed that the consecutive radar scannings
would happen without any time interval between them so that at any time
period the driver of the vehicle would have an information about its sur-
rounding environment. However, the necessity of this condition is not very
clear according to the provided model. Therefore, it should be investigated
whether the SDRAM storage of the intermediate data is really a requirement
if there is enough time between consecutive radar processing. According to
the calculations, if we consider having enough time after the first radar scan-
ning and use 16 bit fixed-point arithmetic, the memory inside the Virtex-6
might be sufficient to store the intermediate data of the first FFT process-
ing. In addition, in-place computation can be used to store the data of the
second FFT processing without using any additional memory storage. This
would require a design of a new DMA core since the current one used in the
implementation has very low bandwidth for single transfers.
Moreover, the current design uses pre-existing IP blocks developed by
Xilinx which provides no information about the internal design of the blocks.
Therefore, it is difficult to perform the data-flow analysis on the model and
provide the real-time guarantees. For the future, some of the blocks such as
the DDR controller could be redesigned based on the specific memory access
patterns of the application. This would allow to perform data-flow analysis on
the model and give estimates about the performance of the implementation
beforehand.
Additionally, we saw in Section 5.1 that FFT blocks take a lot of hardware
resources. An improvement can be made in this part by redesigning the FFT
core to achieve less hardware usage at the same time to provide a sufficient
latency and bandwidth.
Bibliography
[1] “Phase-comparison monopulse.” https://en.wikipedia.org/wiki/
Phase-comparison_monopulse. Accessed: 2016-07-23.
[2] M. Skolnik, Radar-Handbook. McGraw-Hill Professional, 2008.
[3] J. Hasch, E. Topak, R. Schnabel, T. Zwick, R. Weigel, and C. Wald-
schmidt, “Millimeter-Wave Technology for Automotive Radar Sensors in
the 77 GHz Frequency Band,” IEEE Transactions on Microwave Theory
and Techniques, vol. 60, pp. 845 – 860, 2012.
[4] B. Dekens, Low-Cost Heterogeneous Embedded Multiprocessor Architec-
ture for Real-Time Stream Processing Applications. PhD thesis, Univer-
sity of Twente, 10 2015.
[5] V. Issakov, Microwave Circuits for 24 GHz Automotive Radar in Silicon-
based Technologies. Springer, 2010.
[6] I. V. Komarov and S. M. Smolskiy, Fundamentals of Short-Range FM
Radar. McGraw-Hill Professional, 2005.
[7] V. Winkler, “Range Doppler Detection for Automotive FMCW Radars,”
Proceedings of the 4th European Radar Conference, 2007.
[8] Xilinx, “ML605 Hardware User Guide,” 2012. UG534(v1.8).
[9] W. Wiesbeck, Radar Systems Engineering. Karlsruhe Institute of Tech-
nology, 2009. Lecture Script.
[10] R. Feger, C. Wagner, S. Schuster, S. Scheiblhofer, H. J¨
ager, and
A. Stelzer, “A 77-GHz FMCW MIMO Radar Based on an SiGe Single-
Chip Transceiver,” IEEE Transactions on Microwave Theory and Tech-
niques, vol. 57, pp. 1020 – 1035, April 2009.
[11] J. Li and P. Stoica, “MIMO Radar with Collocated Antennas,” IEEE
Signal Processing Magazine, vol. 24, pp. 106 – 114, September 2007.
51
52
BIBLIOGRAPHY
[12] Y. Qu, G. S. Liao, S. Q. Zhu, X. Y. Liu, and H. Jiang, “Performance
Analysis of Beamforming for MIMO Radar,” Progress In Electromag-
netics Research, pp. 123–134, 2008.
[13] Xilinx, “MicroBlaze Processor Reference Guide,” 2013. UG081(v14.7).
[14] W. Zong-bo, J. C. Moya, A. B. del Campo, J. G. Menoyo, and G. Mei-
guo, “Range-Doppler Image Processing in Linear FMCW Radar and
FPGA Based Implementation,” Journal of Communication and Com-
puter, vol. 6, no. 53, pp. 55–62, 2009.
[15] F. Meinl, E. Schubert, M. Kunert, and H. Blume, “Realtime FPGA-
based Processing Unit for a High-Resolution Automotive MIMO Radar
Platform,” European Radar Conference, pp. 213–216, 2015.
[16] E. Hyun, S.-D. Kim, D.-J. Yeom, and J.-H. Lee, “Parallel and Pipelined
Hardware Implementation of Radar Signal Processing for an FMCW
Multi-channel Radar,” Elektronika IR Elektrotechnika, vol. 21, no. 2,
2015.
[17] W. Wang, D. Liang, Z. Wang, H. Yu, and Q. Liu, “Design and Imple-
mentation of a FPGA and DSP Based MIMO Radar Imaging System,”
Radioengineering, vol. 24, pp. 518–527, 2015.
[18] Xilinx, “LogiCORE IP Fast Fourier Transform v8.0,” 2012.
[19] Xilinx, “LogiCORE IP AXI DMA v6.03,” 2012.
[20] Xilinx, “Virtex-6 FPGA Memory Interface Solutions,” 2013. UG406.
[21] Xilinx, “LogiCORE IP AXI TIMER v1.03,” 2012.
[22] Xilinx, “LogiCORE IP AXI Block RAM (BRAM) Controller v1.03,”
2012.
Document Outline - Abstract
- List of Figures
- List of Tables
- List of Acronyms
- Introduction
- FMCW Signal Processing
- Requirements
- Matlab Model
- Computational Analysis
- Architecture Considerations
- Signal-flow Analysis
- System Implementation
- Results and Analysis
- Results
- Hardware Resource Usage
- Tests
- Performance
- Analysis
- Conclusion
Dostları ilə paylaş: |